FAULT TOLERANT DESIGN: 
AN INTRODUCTION 


ELENA DUBROVA 

Department of Microelectronics and Information Technology 
Royal Institute of Technology 
Stockholm, Sweden 


Kluwer Academic Publishers 

Boston/Dordrecht/London 



Contents 


Acknowledgments xi 

1. INTRODUCTION I 

1 Definition of fault tolerance 1 

2 Fault tolerance and redundancy 2 

3 Applications of fault-tolerance 2 

2. FUNDAMENTALS OF DEPENDABILITY 5 

1 Introduction 5 

2 Dependability attributes 5 

2.1 Reliability 6 

2.2 Availability 6 

2.3 Safety 8 

3 Dependability impairments 8 

3.1 Eaults, errors and failures 9 

3.2 Origins of faults 10 

3.3 Common-mode faults 11 

3.4 Hardware faults 11 

3.4.1 Permanent and transient faults 11 

3.4.2 Eault models 12 

3.5 Software faults 13 

4 Dependability means 14 

4.1 Eault tolerance 14 

4.2 Eault prevention 15 

4.3 Eault removal 15 

4.4 Eault forecasting 16 

5 Problems 16 


DRAFT 


March 25, 2008, 2:12ain 


DRAFT 



VI 


FAULT TOLERANT DESIGN: AN INTRODUCTION 


3. DEPENDABILITY EVALUATION TECHNIQUES 19 

1 Introduction 19 

2 Basics of probability theory 20 

3 Common measures of dependability 21 

3.1 Eailure rate 22 

3.2 Mean time to failure 24 

3.3 Mean time to repair 25 

3.4 Mean time between failures 26 

3.5 Eault coverage 26 

4 Dependability model types 27 

4.1 Reliability block diagrams 27 

4.2 Markov processes 28 

4.2.1 Single-component system 30 

4.2.2 Two-component system 30 

4.2.3 State transition diagram simplification 31 

5 Dependability computation methods 32 

5.1 Computation using reliability block diagrams 32 

5.1.1 Reliability computation 32 

5.1.2 Availability computation 33 

5.2 Computation using Markov processes 33 

5.2.1 Reliability evaluation 35 

5.2.2 Availability evaluation 38 

5.2.3 Safety evaluation 41 

6 Problems 42 

4. HARDWARE REDUNDANCY 47 

1 Introduction 47 

2 Redundancy allocation 48 

3 Passive redundancy 49 

3.1 Triple modular redundancy 50 

3.1.1 Reliability evaluation 50 

3.1.2 Voting techniques 52 

3.2 N-modular redundancy 54 

4 Active redundancy 55 

4.1 Duplication with comparison 56 

4.1.1 Reliability evaluation 56 

4.2 Standby sparing 57 

4.2.1 Reliability evaluation 58 


DRAFT 


March 25, 2008, 2:12ain 


DRAFT 



Contents 


vii 



4.3 

Pair-and-a-spare 

62 

5 

Hybrid redundancy 

64 


5.1 

Self-purging redundancy 

64 


5.1.1 

Reliability evaluation 

64 


5.2 

N-modular redundancy with spares 

65 


5.3 

Triplex-duplex redundancy 

66 

6 

Problems 

67 

INFORMATION REDUNDANCY 

71 

I 

Introduction 

71 

2 

Fundamental notions 

73 


2.1 

Code 

73 


2.2 

Encoding 

73 


2.3 

Information rate 

74 


2.4 

Decoding 

74 


2.5 

Hamming distance 

74 


2.6 

Code distance 

75 


2.7 

Code efficiency 

76 

3 

Parity codes 

76 

4 

Linear codes 

79 


4.1 

Basic notions 

79 


4.2 

Definition of linear code 

80 


4.3 

Generator matrix 

81 


4.4 

Parity check matrix 

83 


4.5 

Syndrome 

83 


4.6 

Constructing linear codes 

84 


4.7 

Hamming codes 

85 


4.8 

Extended Hamming codes 

88 

5 

Cyclic codes 

89 


5.1 

Definition 

89 


5.2 

Polynomial manipulation 

90 


5.3 

Generator polynomial 

90 


5.4 

Parity check polynomial 

92 


5.5 

Syndrome polynomial 

93 


5.6 

Implementation of polynomial division 

93 


5.7 

Separable cyclic codes 

95 


5.8 

CRC codes 

97 


5.9 

Reed-Solomon codes 

97 


DRAFT 


March 25, 2008, 2:12ain 


DRAFT 



viii FAULT TOLERANT DESIGN: AN INTRODUCTION 

6 Unordered codes 98 

6.1 M-of-n codes 99 

6.2 Berger codes 99 

7 Arithmetic codes 101 

7.1 AN-codes 101 

7.2 Residue codes 102 

8 Problems 102 

6. TIME REDUNDANCY 107 

1 Introduction 107 

2 Alternating logic 107 

3 Recomputing with shifted operands 109 

4 Recomputing with swapped operands 110 

5 Recomputing with duplication with comparison 110 

6 Problems 111 

7. SOETWARE REDUNDANCY 113 

1 Introduction 113 

2 Single-version techniques 114 

2.1 Eault detection techniques 115 

2.2 Eault containment techniques 115 

2.3 Eault recovery techniques 116 

2.3.1 Exception handling 117 

2.3.2 Checkpoint and restart 117 

2.3.3 Process pairs 119 

2.3.4 Data diversity 119 

3 Multi-version techniques 120 

3.1 Recovery blocks 120 

3.2 A-version programming 121 

3.3 N self-checking programming 123 

3.4 Design diversity 123 

4 Software Testing 125 

4.1 Statement and Branch Coverage 126 

4.1.1 Statement Coverage 126 

4.1.2 Branch Coverage 126 

4.2 Preliminaries 127 

4.3 Statement Coverage Using Kernels 129 

4.4 Computing Minimum Kernels 132 


DRAFT 


March 25, 2008, 2:12ain 


DRAFT 



Contents 


IX 


4.5 Decision Coverage Using Kernels 
5 Problems 


133 

134 


DRAFT 


March 25, 2008, 2:12ain 


DRAFT 




Acknowledgments 


I would like to thank KTH students Ionian Grazhdani, Xavier Lowagie, 
Pieter Nuyts, Henrik Kirkeby, Chen Fu, Kareem Refaat, Sergej Koziner, Julia 
Kuznetsova, and Dr. Roman Morawek from Teehnikum Wien for reading and 
eorreeting the draft of the manuseript. 

I am grateful to the Swedish Foundation for International Cooperation in Re- 
seareh and Higher Edueation (STINT) for the seholarship KU2002-4044 whieh 
supported my trip to the University of New South Wales, Sydney, Australia, 
where the first draft of this book was written during Oetober - Deeember 2002. 


DRAFT 


March 25, 2008, 2:12ain 


DRAFT 




Chapter 1 


INTRODUCTION 


If anything can go wrong, it will. 


—Murphy’s law 


1. Definition of fault tolerance 

Fault tolerance is the ability of a system to continue performing its intended 
function in spite of faults. In a broad sense, fault tolerance is associated with 
reliability, with successful operation, and with the absence of breakdowns. A 
fault-tolerant system should be able to handle faults in individual hardware or 
software components, power failures or other kinds of unexpected disasters and 
still meet its specification. 

Fault tolerance is needed because it is practically impossible to build a per¬ 
fect system. The fundamental problem is that, as the complexity of a system 
increases, its reliability drastically deteriorates, unless compensatory measures 
are taken. For example, if the reliability of individual components is 99.99%, 
then the reliability of a system consisting of 100 non-redundant components is 
99.01%, whereas the reliability of a system consisting of 10.000 non-redundant 
components is just 36.79%. Such a low reliability is unacceptable in most ap¬ 
plications. If a 99% reliability is reqiured for a 10.000 component system, the 
individual components with the reliability of at least 99.999% should be used, 
implying the increase in cost. 

Another problem is that, although designers do their best to have all the 
hardware defects and software bugs cleaned out of the system before it goes 
on the market, history shows that such a goal is not attainable. It is inevitable 
that some unexpected environmental factor is not taken into account, or some 
potential user mistakes are not foreseen. Thus, even in the unlikely case that a 
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system is designed and implemented perfeetly, faults are likely to be eaused by 
situations out of the eontrol of the designers. 

A system is said to fail if it eeased to perform its intended funetion. System 
is used in this book in a generie sense of a group of independent but interrelated 
elements eomprising a unified whole. Therefore, the teehniques presented are 
also applieable to the variety of produets, deviees and subsystems. Failure 
ean be a total eessation of funetion, or a performanee of some funetion in a 
subnormal quality or quantity, like deterioration or instability of operation. The 
aim of fault-tolerant design is to minimize the probability of failures, whether 
those failures simply annoy the eustomers or result in lost fortunes, human 
injury or environmental disaster. 

exist to inerease eomponent reliability. Failure rates in hardware are 

2. Fault tolerance and redundancy 

There are various approaehes to aehieve fault-toleranee. Common to all these 
approaehes is a eertain amount of redundaney. For our purposes, redundancy 
is the provision of funetional eapabilities that would be unneeessary in a fault- 
free environment. This ean be a replieated hardware eomponent, an additional 
eheek bit attaehed to a string of digital data, or a few lines of program eode 
verifying the eorreetness of the program’s results. The idea of ineorporating 
redundaney in order to improve reliability of a system was pioneered by John 
von Neumann in early 1950s in his work “Probabilistie logie and the synthesis 
of reliable organisms from unreliable eomponents”. 

Two kinds of redundaney are possible: spaee redundaney and time redun¬ 
daney. Space redundancy provides additional eomponents, funetions, or data 
items that are unneeessary for a fault-free operation. Spaee redundaney is fur¬ 
ther elassified into hardware, software and information redundaney, depending 
on the type of redundant resourees added to the system. In time redundancy 
the eomputation or data transmission is repeated and the result is eompared to 
a stored eopy of the previous result. 

3. Applications of fault-tolerance 

Originally, fault-toleranee teehniques were used to eope with phy sieal defeets 
of individual hardware eomponents. Designers of early eomputing systems 
employed redundant struetures with voting to eliminate the effeet of failed 
eomponents, error-deteetion or eorreeting eodes to deteet or eorreet information 
errors, diagnostie teehniques to loeate failed eomponents and automatie switeh- 
overs to replaee them. 

Following the development of semieonduetor teehnology, hardware eompo¬ 
nents beeame intrinsieally more reliable and the need for toleranee of eomponent 
defeet diminished in general purpose applieations. Nevertheless, fault toleranee 
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remained neeessary in many safety-, mission- and business-eritieal applieations. 
Safety-critical applieations are those where loss of life or environmental dis¬ 
aster must he avoided. Examples are nuelear power plant eontrol systems, 
eomputer-eontrolled radiation therapy maehines or heart paee-makers, military 
radar systems. Mission-critical applieations stress mission eompletion, as in 
ease of an airplane or a spaeeeraft. Business-critical are those in whieh keep¬ 
ing a business operating is an issue. Examples are bank and stoek exehange’s 
automated trading system, web servers, e-eommeree. 

As eomplexity of systems grew, a need to tolerate other than hardware eom- 
ponent faults has aroused. The rapid development of real-time eomputing appli¬ 
eations that started around the mid-1990s, espeeially the demand for software- 
embedded intelligent deviees, made software fault toleranee a pressing issue. 
Software systems offer eompaet design, rieh funetionality and eompetitive eost. 
Instead of implementing a given funetionality in hardware, the design is done by 
writing a set of instruetions aeeomplishing the desired tasks and loading them 
into a proeessor. If ehanges in the funetionality are needed, the instruetions ean 
be modified instead of building a different physieal deviee. 

An inevitable related problem is that the design of a system is performed 
by someone who is not an expert in that system. Eor example, the autopilot 
expert deeides how the deviee should work, and then provides the information 
to a software engineer, who implements the design. This extra eommunieation 
step is the souree of many faults in software today. The software is doing what 
the software engineer thought it should do, rather than what the original design 
engineer required. Nearly all the serious aeeidents in whieh software has been 
involved in the past ean be traeed to this origin. 
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Chapter 2 


FUNDAMENTALS OF DEPENDABILITY 


Ah, this is obviously some strange usage of the word ’safe ’ that I wasn ’t previously aware 

of- 

—Douglas Adams, "The Hitchhikers Guide to the Galaxy". 


1. Introduction 

The ultimate goal of fault toleranee is the development of a dependable 
system. In a broad term, dependability is the ability of a system to deliver its 
intended level of serviee to its users. As eomputer systems beeome relied upon 
by soeiety more and more, the dependability of these systems beeomes a eritieal 
issue. In airplanes, ehemieal plants, heart paee-makers or other safety eritieal 
applieations, a system failure ean eost people’s lives or environmental disaster. 

In this seetion, we study three fundamental charaeteristies of dependability: 
attributes, impairment and means. Dependability attributes deseribe the prop¬ 
erties whieh are required from a system. Dependability impairments express 
the reasons for a system to eease to perform its funetion or, in other words, the 
threats to dependability. Dependability means are the methods and teehniques 
enabling the development of a dependable eomputing system. 

2. Dependability attributes 

The attributes of dependability express the properties whieh are expeeted 
from a system. Three primary attributes are reliability, availability and safety. 
Other possible attributes inelude maintainability, testability, performability, 
eonfidentiality, seeurity. Depending on the applieation, one or more of these at¬ 
tributes are needed to appropriately evaluate the system behavior. For example, 
in an automatie teller maehine (ATM), the proportion of time whieh system is 
able to deliver its intended level of serviee (system availability) is an important 
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measure. For a cardiae patient with a pacemaker, continuous functioning of the 
device is a matter of life and death. Thus, the ability of the system to deliver its 
service without interruption (system reliability) is crucial. In a nuclear power 
plant control system, the ability of the system to perform its functions correctly 
or to discontinue its function in a safe manner (system safety) is of greater 
importance. 

2.1 Reliability 

Reliability R{t) of a system at time t is the probability that the system oper¬ 
ates without failure in the interval [0, t ], given that the system was performing 
correctly at time 0. 

Reliability is a measure of the continuous delivery of correct service. High 
reliability is required in situations when a system is expected to operate without 
interruptions, as in the case of a pacemaker, or when maintenance cannot be 
performed because the system cannot be accessed. For example, spacecraft 
mission control system is expected to provide uninterrupted service. A flaw 
in the system is likely to cause a destruction of the spacecraft as in the case 
of NASA’s earth-orbiting Lewis spacecraft launched on August 23rd, 1997. 
The spacecraft entered a flat spin in orbit that resulted in a loss of solar power 
and a fatal battery discharge. Contact with the spacecraft was lost, and it then 
re-entered the atmosphere and was destroyed on September 28th. According 
to the report of the Lewis Spacecraft Mission Failure Investigation, the failure 
was due to a combination of a technically flawed attitude-control system design 
and inadequate monitoring of the spacecraft during its crucial early operations 
phase. 

Reliability is a function of time. The way in which time is specified varies 
considerably depending on the nature of the system under consideration. For 
example, if a system is expected to complete its mission in a certain period of 
time, like in case of a spacecraft, time is likely to be defined as a calendar time 
or as a number of hours. For software, the time interval is often specified in 
so called natural or time units. A natural unit is a unit related to the amount 
of processing performed by a software-based product, such as pages of output, 
transactions, telephone calls, jobs or queries. 

2.2 Availability 

Relatively few systems are designed to operate continuously without inter¬ 
ruption and without maintenance of any kind. In many cases, we are interested 
not only in the probability of failure, but also in the number of failures and, 
in particular, in the time required to make repairs. For such applications, the 
attribute which we would like to maximize is the fraction of time that the system 
is in the operational state, expressed by availability. 
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Availability A{t) of a system at time t is the probability that the system is 
funetioning eorreetly at the instant of time t. 

A(t) is also referred as point availability, or instantaneous availability. Often 
it is neeessary to determine the interval or mission availability. It is defined by 

A{T) = l-f A{t)dt. (2.1) 

1 Jo 

A^T) is the value of the point availability averaged over some interval of time 
T. This interval might be the life-time of a system or the time to aeeomplish 
some partieular task. Finally, it is often found that after some initial transient 
effeet, the point availability assumes a time-independent value. In this ease, the 
steady-state availability is defined by 

A(oo) = hm i f A{t)dt. (2.2) 

T^ooT Jo 

If a system eannot be repaired, the point availability A{t) is equal to the 
system’s reliability, i.e. the probability that the system has not failed between 0 
and t. Thus, as T goes to infinity, the steady-state availability of a non-repairable 
system goes to zero 

A(oo) = 0 

Steady-state availability is often speeified in terms of downtime per year. 
Table 2.1 shows the values for the availability and the eorresponding downtime. 


Availability 

Downtime 

90% 

36.5 days/year 

99% 

3.65 days/year 

99.9% 

8.76 hours/year 

99.99% 

52 minutes/year 

99.999% 

5 minutes/year 

99.9999% 

31 seeonds/year 


Table 2.1. Availability and the corresponding downtime per year. 


Availability is typieally used as a measure for systems where short interrup¬ 
tions ean be tolerated. Networked systems, sueh as telephone switehing and 
web servers, fall into this eategory. A eustomer of a telephone system expeets to 
eomplete a eall without interruptions. However, a downtown of three minutes 
a year is eonsidered aeeeptable. Surveys show that web users lose patienee 
when web sites take longer than eight seeonds to show results. This means 
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that such web sites should be available all the time and should respond quickly 
even when a large number of clients concurrently access them. Another ex¬ 
ample is the electrical power control system. Customers expect power to be 
available 24 hours a day, every day, in any weather condition. In some cases, a 
prolonged power failure may lead to health hazards, due to the loss of services 
such as water pumps, heating, light, or medical attention. Industries may suffer 
substantial financial loss. 

2.3 Safety 

Safety can be considered as an extension of reliability, namely a reliability 
with respect to failures that may create safety hazards. From the reliability point 
of view, all failures are equal. On the other hand, for safety considerations, 
failures are partitioned into fail-safe and fail-unsafe ones. 

As an example consider an alarm system. The alarm may either fail to 
function even though a dangerous situation exists, or it may give a false alarm 
when no danger is present. The former is classified as a fail-unsafe failure. The 
latter is considered a fail-safe one. More formally, safety is defined as follows. 

Safety S{t) of a system is the probability that the system will either perform 
its function correctly or will discontinue its operation in a fail-safe manner. 

Safety is required in safety-critical applications were a failure may result in 
an human injury, loss of life or environmental disaster. Examples are chemical 
or nuclear power plant control systems, aerospace and military applications. 

Many unsafe failures are caused by human mistakes. For example, the Cher¬ 
nobyl accident on April 26th, 1986, happened because all safety systems were 
shut off to allow an experiment which aimed investigating a possibility of pro¬ 
ducing electricity from the residual energy in the turbo-generators. The exper¬ 
iment was badly planned, and was led by an electrical engineer who was not 
familiar with the reactor facility. The experiment could not be canceled when 
things went wrong, because all automatic shutdown systems and the emergency 
core cooling system of the reactor had been manually turned off. 

3. Dependability impairments 

Dependability impairment are usually defined in terms of faults, errors, fail¬ 
ures. A common feature of the three terms is that they give us a message that 
something went wrong. A difference is that, in case of a fault, the problem 
occurred on the physical level; in case of an error, the problem occurred on 
the computational level; in case of a failure, the problem occurred on a system 
level. 
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3.1 Faults, errors and failures 

A fault is a physical defect, imperfection, or flaw that occurs in some hard¬ 
ware or software component. Examples are short-circuit between two adjacent 
interconnects, broken pin, or a software bug. 

An error is a deviation from correctness or accuracy in computation, which 
occurs as a result of a fault. Errors are usually associated with incorrect values 
in the system state. Eor example, a circuit or a program computed an incorrect 
value, an incorrect information was received while transmitting data. 

A failure is a non-performance of some action which is due or expected. A 
system is said to have a failure if the service it delivers to the user deviates from 
compliance with the system specification for a specified period of time. A sys¬ 
tem may fail either because it does not act in accordance with the specification, 
or because the specification did not adequately describe its function. 

Eaults are reasons for errors and errors are reasons for failures. Eor example, 
consider a power plant, in which a computer controlled system is responsible 
for monitoring various plant temperatures, pressures, and other physical charac¬ 
teristics. The sensor reporting the speed at which the main turbine is spinning 
breaks. This fault causes the system to send more steam to the turbine than 
is required (error), over-speeding the turbine, and resulting in the mechanical 
safety system shutting down the turbine to prevent damaging it. The system is 
no longer generating power (system failure, fail-safe). 

The definitions of physical, computational and system level are a bit more 
confusing when applied to software. In the context of this book, we interpret a 
program code as physical level, the values of a program state as computational 
level, and the software system running the program as system level. Eor exam¬ 
ple, an operating system is a software system. Then, a bug in a program is a 
fault, possible incorrect value caused by this bug is an error and possible crush 
of the operating system is a failure. 

Not every fault causes an error and not every error causes a failure. This is 
particularly evident in the software case. Some program bugs are very hard to 
find because they cause failures only in very specific situations. Eor example, 
in November 1985, $32 billion overdraft was experienced by the Bank of New 
York, leading to a loss of $5 million in interests. The failure was caused by an 
unchecked overflow of an 16-bit counter. In 1994, Intel Pentium I micropro¬ 
cessor was discovered to compute incorrect answers to certain floating-point 
division calculations. Eor example, dividing 5505001 by 294911 produced 
18.66600093 instead of 18.66665197. The problem had occurred because of 
the omission of five entries in a table of 1066 values used by the division al¬ 
gorithm. The five cells should have contained the constant +2, but because the 
cells were empty, the processor treated them as a zero. 
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3.2 Origins of faults 

As we discussed earlier, failures are caused by errors and errors are caused 
by faults. Faults are, in turn, caused by numerous problems occurring at specifi¬ 
cation, implementation, fabrication stages of the design process. They can also 
be caused by external factors, such as environmental disturbances or human 
actions, either accidental or deliberate. Broadly, we can classify the sources 
of faults into four groups: incorrect specification, incorrect implementation, 
fabrication defects and external factors. 

Incorrect specification results from incorrect algorithms, architectures, or 
requirements. A typical example is a case when the specification requirements 
ignore aspects of the environment in which the system operates. The system 
might function correctly most of the time, but there also could be instances of in¬ 
correct performance. Faults caused by incorrect specifications are usually called 
specification faults. In System-on-a-Chip design, integrating pre-designed in¬ 
tellectual property (IP) cores, specification faults are one of the most common 
type of faults. Core specifications, provided by the core vendors, do not always 
contain all the details that system-on-a-chip designers need. This is partly due 
to the intellectual property protection requirements, especially for core netlists 
and layouts. 

Faults due to incorrect implementation, usually referred to as design faults, 
occur when the system implementation does not adequately implement the 
specification. In hardware, these include poor component selection, logical 
mistakes, poor timing or synchronization. In software, examples of incorrect 
implementation are bugs in the program code and poor software component 
reuse. Software heavily relies on different assumptions about its operating 
environment. Faults are likely to occur if these assumptions are incorrect in 
the new environment. The Ariane 5 rocket accident is an example of a failure 
caused by a reused software component. Ariane 5 rocket exploded 37 seconds 
after lift-off on June 4th, 1996, because of a software fault that resulted from 
converting a 64-bit floating point number to a 16-bit integer. The value of the 
floating point number happened to be larger than the one that can be represented 
by a 16-bit integer. In response to the overflow, the computer cleared its memory. 
The memory dump was interpreted by the rocket as an instruction to its rocket 
nozzles, which caused an explosion. 

A source of faults in hardware are component defects. These include man¬ 
ufacturing imperfections, random device defects and components wear-outs. 
Fabrication defects were the primary reason for applying fault-tolerance tech¬ 
niques to early computing systems, due to the low reliability of components. 
Following the development of semiconductor technology, hardware compo¬ 
nents became intrinsically more reliable and the percentage of faults caused by 
fabrication defects diminished. 
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The fourth cause of faults are external factors, which arise from outside the 
system boundary, the environment, the user or the operator. External factors 
include phenomena that directly affect the operation of the system, such as tem¬ 
perature, vibration, electrostatic discharge, nuclear or electromagnetic radiation 
or that affect the inputs provided to the system. For instance, radiation causing 
a bit to flip in a memory location is a fault caused by an external factor. Faults 
caused by user or operator mistakes can be accidental or malicious. For exam¬ 
ple, a user can accidentally provide incorrect commands to a system that can 
lead to system failure, e.g. improperly initialized variables in software. Mali¬ 
cious faults are the ones caused, for example, by software viruses and hacker 
intrusions. 

3.3 Common-mode faults 

A common-mode fault is a fault which occurs simultaneously in two or more 
redundant components. Common-mode faults are caused by phenomena that 
create dependencies between the redundant units which cause them to fail simul¬ 
taneously, i.e. common communication buses or shared environmental factors. 
Systems are vulnerable to common-mode faults if they rely on a single source 
of power, cooling or input/output (I/O) bus. 

Another possible source of common-mode faults is a design fault which 
causes redundant copies of hardware or of the same software process to fail 
under identical conditions. The only fault-tolerance approach for combating 
common-mode design faults is design diversity. Design diversity is the im¬ 
plementation of more than one variant of the function to be performed. For 
computer-based applications, it is shown to be more efficient to vary a design at 
higher levels of abstractions. For example, varying algorithms is more efficient 
than varying implementation details of a design, e.g. using different program 
languages. Since diverse designs must implement a common system specifica¬ 
tion, the possibility for dependency always arises in the process of refining the 
specification. Truly diverse designs eliminate dependencies by using separate 
design teams, different design rules and software tools. 

3.4 Hardware faults 

In this section we first consider two major classes of hardware faults: per¬ 
manent and transient faults. Then, we show how different types of hardware 
faults can be modeled. 

3.4.1 Permanent and transient faults 

Hardware faults are classified with respect to fault duration into permanent, 
transient and intermittent faults. 
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A permanent fault remains active until a corrective action is taken. These 
faults are usually caused by some physical defects in the hardware, such as shorts 
in a circuit, broken interconnections or stuck bits in the memory. Permanent 
faults can be detected by on-line test routines that work concurrently with the 
normal system operation. 

A transient fault remains active for a short period of time. A transient fault 
that becomes active periodically is an intermittent fault. Because of their short 
duration, transient faults are often detected through the errors that result from 
their propagation. Transient faults are often called soft faults or glitches. Tran¬ 
sient fault are dominant type of faults in computer memories. For example, 
about 98% of RAM faults are transient faults. The causes of transient faults are 
mostly environmental, such as alpha particles, cosmic rays, electrostatic dis¬ 
charge, electrical power drops, overheating or mechanical shock. For instance, 
a voltage spike might cause a sensor to report an incorrect value for a few 
milliseconds before reporting correctly. Studies show that a typical computer 
experiences more than 120 power problems per month. Cosmic rays cause the 
failure rate of electronics at airplane altitudes to be approximately one hundred 
times greater than at sea level. Intermittent faults can be due to implementation 
flaws, aging and wear-out, and to unexpected operation conditions. For exam¬ 
ple, a loose solder joint in combination with vibration can cause an intermittent 
fault. 

3.4.2 Fault models 

It is not possible to enumerate all possible types of faults which can occur 
in a system. To make the evaluation of fault coverage possible, faults are 
assumed to behave according to some fault model. Some of the commonly 
used fault models are: stuck-at fault, transition fault, coupling fault. A fault 
model attempts to describe the effect of the fault that can occur. 

A stuck-at fault is a fault which results in a line in the circuit or a memory 
cell being permanently stuck at a logic one or zero. It is assumed that the basic 
functionality of the circuit is not changed by the fault, i.e. a combinational 
circuit is not transformed to a sequential circuit, or an AND gate does not 
become an OR gate. Due to its simplicity and effectiveness, stuck at fault is the 
most common fault model. 

A transition fault is a fault in which line in the circuit or a memory cell 
cannot change from a particular state to another state. For example, suppose 
a memory cell contains a value zero. If a one is written to the cell, the cell 
successfully changes its state. However, a subsequent write of a zero to the cell 
does not change the state of the cell. The memory is said to have a one-to-zero 
transition fault. Both stuck-at faults and transition faults can be easily detected 
during testing. 
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Coupling faults are more diffieult to test beeause they depend upon more than 
one line. An example of a eoupling fault would be a short-eireuit between two 
adjaeent word lines in a memory. Writing a value to a memory eell eonneeted 
to one of the word lines would also result in that value being written to the 
eorresponding memory eell eonneeted to the other short-eireuited word line. 
Two types of transition eoupling faults inelude inversion eoupling faults in 
whieh a speeifie transition in one memory eell inverts the eontents of another 
memory eell, and idempotent eoupling faults in whieh a speeifie transition of 
one memory eell results in a partieular value (0 or 1) being written to another 
memory eell. 

Clearly, fault models are not aeeurate in 100% eases, beeuase faults ean eause 
a variety of different effeets. However, studies have shown that a eombination 
of several fault models ean give a very preeise eoverage of aetual faults. For 
example, for memories, praetieally all faults ean be modeled as a eombination 
of stuek-at faults, transition faults and idempotent eoupling faults. 

3.5 Software faults 

Software differs from hardware in several aspeets. First, software does not 
age or wear out. Unlike meehanieal or eleetronie parts of hardware, software 
eannot be deformed, broken or affeeted by environmental faetors. Assuming 
that software is deterministie, it will always behave the same way in the same 
eireumstanees, unless there are problems in hardware that ehange the storage 
eontent or data path. Sinee the software does not ehange onee it is uploaded 
into memory and starts running, trying to aehieve fault toleranee by simply 
replieating the same software modules will not work, beeause all eopies will 
have identieal faults. 

Seeond, software may undergo several upgrades during the system life ey- 
ele. These ean be either reliability upgrades or feature upgrades. A reliability 
upgrade targets to enhanee software reliability or seeurity. This is usually done 
by re-designing or re-implementing some modules using better engineering 
approaehes. A feature upgrade aims to enhanee the funetionality of the soft¬ 
ware. It is likely to inerease the eomplexity and thus deerease the reliability by 
possibly introdueing additional faults into the software. 

Third, fixing bugs does not neeessarily make the software more reliable. On 
the eontrary, new unexpeeted problems may arise. For example, in 1991, a 
ehange of three lines of eode in a signaling program eontaining millions of 
lines of eode eaused the loeal telephone systems in California and along the 
Eastern eoast to stop. 

Finally, sinee software is inherently more eomplex and less regular than hard¬ 
ware, aehieving suffieient verifieation eoverage is more diffieult. Traditional 
testing and debugging methods are inadequate for large software systems. The 
reeent foeus on formal methods promises higher eoverage, however, due to their 
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extremely large computational complexity they are only applicable in specific 
applications. Due to incomplete verification, most of the software faults are 
design faults, occurring when a programmer either misunderstands the speci¬ 
fication or simply makes a mistake. Design faults are related to fuzzy human 
factors, and therefore they are harder to prevent. In hardware, design faults 
may also exist, but other types of faults, such as fabrication defects and tran¬ 
sient faults caused by environmental factors, usually dominate. 

4. Dependability means 

Dependability means are the methods and techniques enabling the devel¬ 
opment of a dependable system. Fault tolerance, which is the subject of this 
book, is one of such methods. It is normally used in a combination with other 
methods to attain dependability, such as fault prevention, fault removal and fault 
forecasting. Fault prevention aims to prevent the occurrences or introduction 
of faults. Fault removal aims to reduce the number of faults which are present 
in the system. Fault forecasting aims to estimate how many faults are present, 
possible future occurrences of faults, and the impact of the faults on the system. 

4.1 Fault tolerance 

Fault tolerance targets the development of systems which function correctly 
in presence of faults. Fault tolerance is achieved by using some kind of redun¬ 
dancy. In the context of this book, redundancy is the provision of functional 
capabilities that would be unnecessary in a fault-free environment. The re¬ 
dundancy allows either to mask a fault, or to detect a fault, with the following 
location, containment and recovery. 

Fault masking is the process of insuring that only correct values get passed to 
the system output in spite of the presence of a fault. This is done by preventing 
the system from being affected by errors by either correcting the error, or com¬ 
pensating for it in some fashion. Since the system does not show the impact of 
the fault, the existence of fault is therefore invisible to the user/operator. For 
example, a memory protected by an error-correcting code corrects the faulty 
bits before the system uses the data. Another example of fault masking is triple 
modular redundancy with majority voting. 

Fault detection is the process of determining that a fault has occurred within 
a system. Examples of techniques for fault detection are acceptance tests and 
comparison. Acceptance tests are common in processors. The result of a 
program is subjected to a test. If the result passes the test, the program continues 
execution. A failed acceptance test implies a fault. Comparison is used for 
systems with duplicated components. A disagreement in the results indicates 
the presence of a fault. 
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Fault location is the process of determining where a fault has occurred. A 
failed acceptance test cannot generally be used to locate a fault. It can only tell 
that something has gone wrong. Similarly, when a disagreement occurs during 
the comparison of two modules, it is not possible to tell which of the two has 
failed. 

Fault containment is the process of isolating a fault and preventing the prop¬ 
agation of the effect of that fault throughout the system. The purpose is to 
limit the spread of the effects of a fault from one area of the system into an¬ 
other area. This is typically achieved by frequent fault detection, by multiple 
request/confirmation protocols and by performing consistency checks between 
modules. 

Once a faulty component has been identified, a system recovers by recon¬ 
figuring ifself fo isolafe fhe componenf from fhe resf of fhe sysfem and regain 
operafional sfafus. This mighf be accomplished by having fhe componenf re¬ 
placed, by marking if off-line and using a redundanf sysfem. Alfernafely, fhe 
sysfem could swifch if off and continue operafion wifh a degraded capabilify. 
This is known as graceful degradation. 

4.2 Fault prevention 

Faulf prevention is achieved by qualify confrol fechniques during specifica¬ 
tion, implemenfafion and fabricafion sfages of fhe design process. For hardware, 
fhis includes design reviews, componenf screening and fesfing. For soffware, 
fhis includes sfrucfural programming, modularizafion and formal verification 
fechniques. 

A rigorous design review may eliminafe many of fhe specificafion faulfs. If 
a design is efficienfly fesfed, many of ifs faulfs and componenf defecfs can be 
avoided. Faulfs infroduced by exfernal disfurbances such as lighfning or radi- 
afion are prevenfed by shielding, radiation hardening, efc. User and operafion 
faulfs are avoided by fraining and regular procedures for mainfenance. Delib- 
erafe malicious faulfs caused by viruses or hackers are reduced by firewalls or 
similar securify means. 

4.3 Fault removal 

Faulf removal is performed during fhe developmenf phase as well as during 
fhe operafional life of a sysfem. During fhe developmenf phase, faulf removal 
consisfs of fhree sfeps: verificafion, diagnosis and correcfion. Faulf removal 
during fhe operafional life of fhe sysfem consisfs of correcfive and prevenfive 
mainfenance. 

Verification is fhe process of checking whefher fhe sysfem meefs a sef of given 
condifions. If if does nof, fhe ofher fwo sfeps follow: fhe faulf fhaf prevenfs fhe 
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conditions from being fulfilled is diagnosed and the necessary corrections are 
performed. 

In preventive maintenance, parts are replaced, or adjustments are made before 
failure occurs. The objective is to increase the dependability of the system over 
the long term by staving off the aging effects of wear-out. In contrast, corrective 
maintenance is performed after the failure has occurred in order to return the 
system to service as soon as possible. 

4.4 Fault forecasting 

Fault forecasting is done by performing an evaluation of the system behavior 
with respect to fault occurrences or activation. The evaluation can be qualita¬ 
tive, that aims to rank the failure modes or event combinations that lead to system 
failure, or quantitative, that aims to evaluate in terms of probabilities the extent 
to which some attributes of dependability are satisfied, or coverage. Informally, 
coverage is the probability of a system failure given that a fault occurs. Sim¬ 
plistic estimates of coverage merely measure redundancy by accounting for the 
number of redundant success paths in a system. More sophisticated estimates 
of coverage account for the fact that each fault potentially alters a system’s 
ability to resist further faults. We study qualitative and quantitative evaluation 
techniques in more details in the next section. 

5. Problems 

2.1. What is the primary goal of fault tolerance? 

2.2. Give three examples of applications in which a system failure can cost 
people’s lives or environmental disaster. 

2.3. What is dependability of a system? Why the dependability of computer 
systems is a critical issue nowadays? 

2.4. Describe three fundamental characteristics of dependability. 

2.5. What do the attributes of dependability express? Why different attributes 
are used in different applications? 

2.6. Define the reliability of a system. What property of a system the reliability 
characterizes? In which situations is high reliability required? 

2.7. Define point, interval and steady-state availabilities of a system. Which 
attribute we would like to maximize in applications requiring high avail¬ 
ability? 

2.8. What is the difference between the reliability and the availability of a system? 
How does the point availability compare to the system’s reliability if the 
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system cannot be repaired? What is the steady-state availability of a non- 
repairable system? 

2 . 9 . Compute the downtime per year for A(oo) = 80%,75% and 50%. 

2 . 10 . A telephone system has less than 3 min per year downtime. What is its 
steady-state availability? 

2 . 11 . Define the safety of a system. Into which two groups the failures are par¬ 
titioned for safety analysis? Give example of applications requiring high 
safety. 

2 . 12 . What are dependability impairments? 

2 . 13 . Explain the differences between faults, errors and failures and the relation¬ 
ships between them. 

2 . 14 . Describe the four major groups of fault sources. Give an example for each 
group. In your opinion, which of the groups causes “most expensive” faults? 

2 . 15 . What is a common-mode fault? By what kind of phenomena common-mode 
faults are caused? Which systems are most vulnerable to common-mode 
faults? Give examples. 

2 . 16 . How are hardware faults classified wifh respecf fo faulf durafion? Give an 
example for each type of faulls. 

2 . 17 . Why are faulf models infroduced? Can faulf models guarantee 100% accu¬ 
racy? 

2 . 18 . Give an example of a combinafional logic circuif in which a single sfuck-af 
faulf on a given line never causes an error on fhe oufpuf. 

2 . 19 . Suppose fhaf we modify fhe sfuck-af faulf model in fhe following way. 
Insfead of having a line being permanenfly sfuck af a logic one or zero 
value, we have a fransisfor being permanenfly open or closed. Draw a 
fransisfor-level circuif diagram of a CMOS NAND gate. 

(a) Give an example of a faulf in your circuif which can be modeled by fhe 
new model buf cannof be modeled by fhe sfandard sfuck-af faulf model. 

(b) Find a faulf in your circuif which cannof be modeled by fhe new model 
buf can be modeled by fhe sfandard sfuck-af faulf model. 

2 . 20 . Explain main differences befween soflware and hardware faulls. 

2 . 21 . Whal are dependability means? Whal are fhe primary goals of faulf preven- 
fion, faulf removal and faulf forecasting? 
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2 . 22 . What is redundancy? Is redundancy necessary for fault-tolerance? Will any 
redundant system be fault-tolerant? 

2 . 23 . Does a fault need to be detected to be masked? 

2 . 24 . Define fault containment. Explain why fault containment is important. 

2 . 25 . Define graceful degradation. Give example of application where graceful 
degradation is desirable. 

2 . 26 . How is fault prevention achieved? Give examples for hardware and for 
software. 

2 . 27 . During which phases of system’s life is fault removal performed? 

2 . 28 . What types of faults are targeted by verification? 

2 . 29 . What are the objectives of preventive and corrective maintenances? 

2 . 30 . Consider the logic circuit shown on p. 108, Fig. 6.2 (full adder). Ignore the 
s-a-1 fault shown on the picture, i.e. the circuit you analyze does not have 
this fault. 

(a) Find a test for stuck-at-1 fault on the input b. 

(b) Find a test for stuck-at-0 fault on the fan-out branch of the input a which 
feeds into an AND gate (lower input of the AND gate whose output is 
marked "s-a-1" on the picture). 


DRAFT 


March 25 


2008 


2:12ain 


DRAFT 



Chapter 3 


DEPENDABILITY EVALUATION TECHNIQUES 


A common mistake that people make when trying to design something completely foolproof 
is to underestimate the ingenuity of complete fools. 

—Douglas Adams, Mostly Harmless 


1. Introduction 

Along with cost and performance, dependability is the third critical criterion 
based on which system-related decisions are made. Dependability evaluation is 
important because it helps identifying which aspect of the system behavior, e.g. 
component reliability, fault coverage or maintenance strategy plays a critical 
role in determining overall system dependability. Thus, it provides a proper 
focus for product improvement effort from early in the development stage to 
fabrication and test. 

There are two conventional approaches to dependability evaluation: (1) mod¬ 
eling of a system in the design phase, or (2) assessment of the system in a later 
phase, typically by test. The first approach relies on probabilistic models that 
use component level failure rates published in handbooks or supplied by the 
manufacturers. This approach provides an early indication of system depend¬ 
ability, but the model as well as the underlying data later need to be validated 
by actual measurements. The second approach typically uses test data and re¬ 
liability growth models. It involves fewer assumptions than the first, but it can 
be very costly. The higher the dependability required for a system, the longer 
the test. A further difficulty arises in the translation of reliability data obtained 
by test into those applicable to the operational environment. 

Dependability evaluation has two aspects. The first is qualitative evaluation, 
that aims to identify, classify and rank the failure modes, or the event combi¬ 
nations that would lead to system failures. For example, component faults or 
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environmental eonditions are analyzed. The seeond aspeet is quantitative eval¬ 
uation, that aims to evaluate in terms of probahilities the extent to whieh some 
attributes of dependability, sueh as reliability, availability, safety, are satisfied. 
Those attributes are then viewed as measures of dependability. 

In this ehapter we study eommon dependability measures, sueh as failure rate, 
mean time to failure, mean time to repair, ete. Examining the time dependenee 
of failure rate and other measures allows us to gain additional insight into 
the nature of failures. Next, we examine possibilities for modeling of system 
behaviors using reliability bloek diagrams and Markov proeesses. Finally, we 
show how to use these models to evaluate system’s reliability, availability and 
safety. 

We begin with a brief introduetion into the probability theory, neeessary to 
understand the presented material. 

2. Basics of probability theory 

Probability is the braneh of mathematics which studies the possible outcomes 
of given events together with their relative likelihoods and distributions. In 
common language, the word "probability" is used to mean the chance that a 
particular event will occur expressed on a linear scale from 0 (impossibility) to 
1 (certainty). 

The first axiom of probability theory states that the value of probability of 
an event A lies between 0 and 1: 


0<p{A) < 1. 


(3.1) 


Let A denotes the event “not A”. For example, if A stands for “it rains”, A 
stands for “it does not rain”. The second axiom of probability theory says that 
the probability of an event A is equal to 1 minus the probability of the event A: 

p{A) = \-p{A). (3.2) 


Suppose that one event, A is dependent on another event, B. Then P{A\B) 
denotes the conditional probability of event A, given event B. The fourth rule 
of probability theory states that the probability p{A-B) that both A and B will 
occur is equal to the probability that B occur times the conditional probability 
P(A|B): 


p{A ■ B) = p{A\B) ■ p{B), if A depends on B. (3.3) 


If p{B) is greater than zero, the equation 3.3 can be written as 
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p{A\B) 


P{A-B) 

P{B) 


(3.4) 


An important condition that we will often assume is that two events are 
mutually independent. For events A and B to be independent, the probability 
p{A) does not depend on whether B has already occurred or not, and vice versa. 
Thus, p{A\B) = p{A). So, for independent events, the rule (3.3) reduces to 


p{A ■ B) = p{A) ■ p{B ), if A and B are independent events. (3.5) 


This is the definition of independence, that the probability of two events both 
occurring is the product of the probabilities of each event occurring. Situations 
also arise when the events are mutually exclusive. That is, if A occurs, B cannot, 
and vice versa. So,p{A -B) = 0 and p{B-A) = 0 and the equation 3.3 becomes 

p{A • B) = 0, if A and B are mutually exclusive events. (3.6) 

This is the definition of mutually exclusiveness, that the probability of two 
events both occurring is zero. 

Let us now consider the situation when either A, or B, or both event may 
occur. The probability p(A + B) is given by 

p{A + B)=p{A) + p{B)-p{A-B) (3.7) 

Combining (3.6) and (3.7), we get 

p{A + B) = p{A) + p{B ), if A and B are mutually exclusive events. (3.8) 


As an example, consider a system consisting of three identical components 
A, B and C, each having a reliability R. Let us compute the probability of 
exactly one out of three components failing, assuming that the failures of the 
individual components are independent. By rule (3.2), the probability that a 
single component fails is 1 —/?. Then, by rule (3.5), the probability that a single 
component fails and the other two remain operational is (1 —R)R^. Since, the 
probabilities of any of the three components to fail are the same, then the overall 
probability of one component failing and other two not is 3(1 —R)R^. The three 
probabilities are added by applying rule (3.8), because the events are mutually 
exclusive. Suppose that one event, A is dependent on another event, B. Then 
P{A\B) denotes the conditional probability of event A, given event B. 

3. Common measures of dependability 

In this section, we describe common dependability measures: failure rate, 
mean time to failure, mean time to repair, mean time between failures and fault 
coverage. 
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3.1 Failure rate 

Failure rate X is the expected number of failures per unit time. For example, 
if a processor fails, on average, once every 1000 hours, then it has a failure rate 
X= 1/1000 failures/hour. 

Often failure rate data is available at component level, but not for the entire 
system. This is because several professional organizations collect and publish 
failure rate estimates for frequently used components (diodes, switches, gates, 
flip-flops, etc.). At the same time the design of a new system may involve new 
configurations of such standard components. When component failure rates are 
available, a crude estimation of the failure rate of a non-redundant system can 
be done by adding the failure rates X, of the components: 

x = th 

i=l 

Failure rate changes as a function of time. For hardware, a typical evolution 
of failure rate over a system’s life-time is characterized by the phases of infant 
mortality (I), useful life (II) and wear-out (III). These phases are illustrated by 
bathtub curve relationship shown in Figure 3.1. Failure rate at first decreases due 
to frequent failures in weak components with manufacturing defects overlooked 
during manufacturer’s testing (poor soldering, leaking capacitor, etc.), then 
stabilizes after a certain time and then increases as electronic or mechanical 
components of the system physically wear out. 



Figure 3.1. Typical evolution of failure rate over a life-time of a hardware system. 


During the useful life phase of the system, failure rate function is assumed to 
have a constant value X. Then, the reliability of the system varies exponentially 
as a function of time: 


R{t) = e-^ 


(3.9) 
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This law is known as exponential failure law. The plot of reliability as a 
funetion of time is shown in Figure 3.2. 



The exponential failure law is very valuable for the analysis of reliability of 
eomponents and systems in hardware. However, it ean only be used in eases 
when the assumption that the failure rate is eonstant is adequate. Software 
failure rate usually deereases as a funetion of time. A possible eurve is shown 
in Figure 3.3. The three phases of evolution are: test/debug (I), useful life (II) 
and obsoleseenee (III). 

Software failure rate during useful life depends on the following faetors: 

1 software proeess used to develop the design and eode 

2 eomplexity of software, 

3 size of software, 

4 experienee of the development team, 

5 pereentage of eode reused from a previous stable projeet, 

6 rigor and depth of testing at test/debug (I) phase. 

There are two major differenees between hardware and software eurves. 
One differenee is that, in the useful-life phase, software normally experienees 
an inerease in failure rate eaeh time a feature upgrade is made. Sinee the 
funetionality is enhaneed by an upgrade, the eomplexity of software is likely to 
be inereased, inereasing the probability of faults. After the inerease in failure 
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rate due to an upgrade, the failure rate levels off gradually, partly beeause of 
the bugs found and fixed after the upgrades. The seeond differenee is that, in 
the last phase, software does not have an inereasing failure rate as hardware 
does. In this phase, the software is approaehing obsoleseenee and there is no 
motivation for more upgrades or ehanges. 



Figure 3.3. Typical evolution of failure rate over a life-time of a software system. 


3.2 Mean time to failure 

Another important and frequently used measure of interest is mean time to 
failure defined as follows. 

The mean time to failure (MTTF) of a system is fhe expeefed lime of Ihe 
oeeurrenee of fhe firsl system failure. 

If n idenlieal syslems are plaeed info operation af time t = 0 and fhe lime t;, 
/ = {1,2,..., n}, lhal eaeh system i operales before failing is measured Ihen fhe 
average lime is MTTF: 


1 " 

MTTF = - • y 

« ti 

In terms of system reliability R{t), MTTF is defined as 


MTTF = 



(3.10) 


(3.11) 


So, MTTF is Ihe area under Ihe reliabilily eurve in Figure 3.2. If Ihe reliabilily 
funelion obeys Ihe exponential failure law (3.9), Ihen Ihe solution of (3.11) is 
given by 


1 

MTTF = - 

A, 


(3.12) 
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where X is the failure rate of the system. The smaller the failure rate is, the 
longer is the time to the first failure. 

In general, MTTF is meaningful only for systems that operate without repair 
until they experienee a system failure. In a real situation, most of the mission 
eritieal systems undergo a eomplete eheek-out before the next mission is under¬ 
taken. All failed redundant eomponents are replaeed and the system is returned 
to a fully operational status. When evaluating the reliability of sueh systems, 
mission time rather than MTTF is used. 

3.3 Mean time to repair 

The mean time to repair (MTTR) of a system is the average time required to 
repair the system. 

MTTR is eommonly speeified in terms of the repair rate p, whieh is the 
expeeted number of repairs per unit time: 


MTTR = 


1 


(3.13) 


MTTR depends on fault reeovery meehanism used in the system, loeation 
of the system, loeation of spare modules (on-site versus off-site), maintenanee 
sehedule, ete. A low MTTR requirement means a high operational eost of the 
system. For example, if repair is done by replaeing the hardware module, the 
hardware spares are kept on-site and the site is maintained 24 hours a day, then 
the expeeted MTTR ean be 30 min. However, if the site maintenanee is relaxed 
to regular working hours on week days only, the expeeted MTTR inereases to 
3 days. If the system is remotely loeated and the operator need to be flown in 
fo replaee fhe faulty module, fhe MTTR ean be 2 weeks. In soflware, if fhe 
failure is defeefed by wafehdog timers and fhe proeessor aufomafieally resfarf 
fhe failed fasks, wifhouf operating system reboof, fhen MTTR ean be 30 see. If 
soflware faull deleelion is nol supported and a manual reboof by an operator is 
required, fhen fhe MTTR ean range from 30 min to 2 weeks, depending on fhe 
loealion of fhe syslem. 

If fhe syslem experienees n failures during ils lifetime, fhe lolal time lhal fhe 
system is operational is n-MTTF. Likewise, fhe folal time fhe system is being 
repaired is n ■ MTTR. The sleady slate availability given by fhe expression (2.2) 
ean be approximaled as 


A(oo) 


n-MTTF _ MTTF 
n-MTTF Fn-MTTR ~ MTTF F MTTR 


(3.14) 


In seelion 5.2.2, we will see an allernalive approaeh for eompufing availabilily, 
whieh uses Markov proeesses. 
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3.4 Mean time between failures 

The mean time between failures (MTBF) of a system is the average time 
between failures of the system. 

If we assume that a repair makes the system perfeet, then the relationship 
between MTBF and MTTF is as follows: 

MTBF = MTTF + MTTR (3.15) 


3.5 Fault coverage 

There are several types of fault eoverage, depending on whether we are eon- 
eerned with fault deteetion, fault loeation, fault eontainment or fault reeovery. 
Intuitively, fault eoverage is the probability that the system will not fail to per¬ 
form the expeeted aetions when a fault oeeurs. More preeisely, fault eoverage 
is defined in terms of the eonditional probability P{A\B), read as “probability 
of A given B”. 

Fault detection coverage is the eonditional probability that, given the exis- 
tenee of a fault, the system deteets it. 

C = F(fault deteetion I fault existenee) 

For example, a system requirement ean be that 99% of all single stuek-at 
faults are deteeted. The fault deteetion eoverage is a measure of the system’s 
ability to meet sueh a requirement. 

Fault loeation eoverage is the eonditional probability that, given the existenee 
of a fault, the system loeates it. 

C = /’(fault loeation I fault existenee) 

It is eommon to require systems to loeate faults within easily replaeeable 
modules. In this ease, the fault loeation eoverage ean be used as a measure of 
sueeess. 

Similarly, fault containment coverage is the eonditional probability that, 
given the existenee of a fault, the system eontains it. 

C = /’(fault eontainment I fault existenee) 

Finally,/anZt recovery coverage is the eonditional probability that, given the 
existenee of a fault, the system reeovers. 

C = P(fault reeovery I fault existenee) 
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4. Dependability model types 

In this section we consider two common dependability models: reliability 
block diagrams and Markov processes. Reliability block diagrams belong to a 
class of combinatorial models, which assume that the failures of the individual 
components are mutually independent. Markov processes belong to a class 
of stochastic processes which take the dependencies between the component 
failures into account, making the analysis of more complex scenarios possible. 

4.1 Reliability block diagrams 

Combinatorial reliability models include reliability block diagrams, fault 
trees, success trees and reliability graphs. In this section we will consider the 
oldest and most common reliability model: reliability block diagrams. 

A reliability block diagram presents an abstract view of the system. The 
components are represented as blocks. The interconnections among the blocks 
show the operational dependency between the components. Blocks are con¬ 
nected in series if all of them are necessary for the system to be operational. 
Blocks are connected in parallel if only one of them is sufficient for the system 
to operate correctly. A diagram for a two-component serial system is shown 
in Figure 3.4(a). Figure 3.4(b) shows a diagram of a two-component parallel 
system. Models of more complex systems may be built by combining the serial 
and parallel reliability models. 



Figure 3.4. Reliability block diagram of a two-component system: (a) serial, (b) parallel. 


As an example, consider a system consisting of two duplicated processors 
and a memory. The reliability block diagram for this system is shown in Figure 
3.5. The processors are connected in parallel, since only one of them is sufficient 
for the system to be operational. The memory is connected in series, since its 
failure would cause the system failure. 



Figure 3.5. Reliability block diagram of a three-component system. 


DRAFT 


March 25, 2008, 2:12ain 


DRAFT 










28 


FAULT TOLERANT DESIGN: AN INTRODUCTION 


Reliability block diagrams are a popular model, because they are easy to 
understand and to use for modeling systems with redundancy. In the next 
section we will see that they are also easy to evaluate using analytical methods. 
However, reliability block diagrams, as well as other combinatorial reliability 
models, have a number of serious limitations. 

First, reliability block diagrams assume that the system components are lim¬ 
ited to the operational and failed states and that the system configuration does 
not change during the mission. Hence, they cannot model standby components, 
repair, as well as complex fault detection and recovery mechanisms. Second, 
the failures of the individual components are assumed to be independent. There¬ 
fore, the case when the sequence of component failures affects system reliability 
cannot be adequately represented. 

4.2 Markov processes 

Contrary to combinatorial models, Markov processes take into account the 
interactions of component failures making the analysis of complex scenarios 
possible. Markov processes theory derives its name from the Russian mathe¬ 
matician A. A. Markov (1856-1922), who pioneered a systematic investigation 
of describing random processes mathematically. 

Markov processes are a special class of stochastic processes. The basic 
assumption is that the behavior of the system in each state is memoryless. 
The transition from the current state of the system is determined only by the 
present state and not by the previous state or the time at which it reached the 
present state. Before a transition occurs, the time spent in each state follows 
an exponential distribution. In dependability engineering, this assumption is 
satisfied if all events (failures, repairs, etc.) in each state occur with constant 
occurrence rates. 

Markov processes are classified based on sfate space and time space char¬ 
acteristics as shown in Table 3.1. In most dependability analysis applications. 


State Space 

Time Space 

Common Model Name 

Discrete 

Discrete 

Discrete Time Markov Chains 

Discrete 

Continuous 

Continuous Time Markov Chains 

Continuous 

Discrete 

Continuous State, Discrete Time 
Markov Processes 

Continuous 

Continuous 

Continuous State, Continuous Time 
Markov Processes 


Table 3.1. Four types of Markov processes. 
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the state spaee is diserete. For example, a system might have two states: op¬ 
erational and failed. The time seale is usually eontinuous, whieh means that 
eomponent failure and repair times are random variables. Thus, Continuous 
Time Markov Chains are the most eommonly used. In some textbooks, they are 
ealled Continuous Markov Models. There are, however, applieations in whieh 
time seale is diserete. Examples inelude synehronous eommunieation protoeol, 
shifts in equipment operation, ete. If both time and state spaee are diserete, then 
the proeess is ealled Discrete Time Markov Chain. 

Markov proeesses are illustrated graphieally by state transition diagrams. 
A state transition diagram is a direeted graph G = {V,E), where V is the set 
of vertiees representing system states and E is the set of edges representing 
system transitions. State transition diagram is a mathematieal model whieh ean 
be used to represent a wide variety of proeesses, i.e. radioaetive breakdown or 
ehemieal reaetion. For dependability models, a state is defined to be a partieular 
eombination of operating and failed eomponents. For example, if we have a 
system eonsisting of two eomponents, then there are four different eombinations 
enumerated in Table 3.2, where O indieates an operational eomponent and E 
indieates a failed eomponent. 


Component 

1 2 

State 

Number 

0 

0 

1 

0 

E 

2 

E 

0 

3 

E 

E 

4 


Table 3.2. Markov states of a two-component system. 


The state transitions refleet the ehanges whieh oeeur within the system state. 
For example, if a system with two identieal eomponent is in state (11), and the 
first module fails, then the system moves to state (01). So, a Markov proeess 
represents possible ehains of events whieh oeeur within a system. In the ease 
of dependability analysis, these events are failures and repairs. 

Eaeh edge earries a label, refleeting the rate at whieh the state transitions 
oeeur. Depending on the modeling goals, this ean be failure rate, repair rate or 
both. 

We illustrate the eoneept first on a simple system, eonsisting of a single 
eomponent. 
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4.2.1 Single-component system 

A single component has only two states: one operational (state 1) and one 
failed (state 2). If no repair is allowed, there is a single, non-reversible transi¬ 
tion between the states, with a label X corresponding to the failure rate of the 
component (Figure 3.6). 



Figure 3.6. State transition diagram of a single-component system. 


If repair is allowed, then a transition between the failed and the operational 
states is possible, with a repair rate /r (Figure 3.7). State diagrams incorporating 
repair are used in availability analysis. 


A 





Figure 3.7. State transition diagram of a single-component system incorporating repair. 


Next, suppose that we would like to distinguish between failed-safe and 
failed-unsafe states, as required in safety analysis. Let state 2 be a failed-safe 
and state 3 be a fail-unsafe state (Figure 3.8). The transition between state 
1 and state 2 depends on both component failure rate X and the probability 
that, is a fault exists, the system succeeds in detecting it and in taking the 
corresponding actions to fail in a safe manner, i.e. on fault coverage C. The 
transition between state 1 and the failed-unsafe state 3 depends on the failure 
rate X and the probability that a fault is not detected, i.e. 1 — C. 



Figure 3.8. State transition diagram of a single-component system for safety analysis. 


4.2.2 Two-component system 

A two-component system has four possible states, enumerated in Table 3.2. 
The changes of states are illustrated by a state transition diagram shown in 
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Figure 3.9. The failure rates and X 2 for eomponents 1 and 2 indieate the 
rates at whieh the transitions are made between the states. The two eomponents 
are assumed to be independent and non-repairable. 



Figure 3.9. State transition diagram for a two independent component system. 


If the eomponents are in a serial eonfiguration, then any eomponent failure 
eauses system failure. So, only state 1 is an operational state. States 2, 3 and 4 
are failed states. If the eomponents are in parallel, both eomponents must fail 
to have a system failure. Therefore, states 1, 2 and 3 are the operational states, 
whereas state 4 is a failed state. 

4.2,3 State transition diagram simplification 

It is often possible to reduee the size of a state transition diagram without 
a saerifiee in aeeuraey. For example, suppose the eomponents in the two- 
eomponent system shown in Figure 3.9 are in parallel. If the eomponents have 
identieal failure rates = X 2 = X, then it is not neeessary to distinguish between 
the states 2 and 3. Both states represent a eondition where one eomponent is 
operational and one is failed. So, we ean merge these two states into one. 
(Figure 3.10). The assignments of the state numbers in the simplified transition 
diagram are shown in Table 3.3. Sinee the failures of eomponents are assumed 


Component 

1 2 

State 

Number 

0 

0 

1 

0 

F 

2 

F 

0 

2 

F 

F 

3 


Table 3.3. Markov states of a simplified state transition diagram of a two-component parallel 
system. 


to be independent events, the transition rate from state 1 to state 2 in Figure 
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3.10 is the sum of the transition rates from state 1 to states 2 and 3 in Figure 
3.9, i. e. IX. 



Figure 3.10. Simplified state transition diagram of a two-component parallel system. 


5. Dependability computation methods 

In this seetion we study how reliability bloek diagrams and Markov proeesses 
ean be used to evaluate system dependability. 

5.1 Computation using reliability block diagrams 

Reliability bloek diagrams ean be used to eompute system reliability as well 
as system availability. 

5.1.1 Reliability computation 

To eompute the reliability of a system represented by a reliability bloek 
diagram, we need first to break the system down into its serial and parallel 
parts. Next, the reliabilities of these parts are eomputed. Finally, the overall 
solution is eomposed from the reliabilities of the parts. 

Given a system eonsisting of n eomponents with R,(t) being the reliability 
of the /th eomponent, the reliability of the overall system is given by 


m 


n”=i for ^ series strueture, 

1 — nLi (1 ~ for ^ parallel strueture. 


(3.16) 


In a serial system, all eomponents should be operational for a system to 
funetion eorreetly. Henee, by rule (3.5), Rseriai{t) = fri a parallel 

system, only one of the eomponents is required for a system to be operational. 
So, the unreliability of a parallel system is equal to the probability that all n 
elements fail, i.e. QparaUei{t) = IILi 2t(0 = nLi(l Henee, by rule 

1, Rparallelit) = 1 — QparaUel{t) = 1 —nf=l(l 

Designing a reliable serial system is diffieult. For example, if a serial system 
with 100 eomponents is to be build, and eaeh of the eomponents has a reliability 
0.999, the overall system reliability is 0.999 = 0.905. 

On the other hand, a parallel system ean be made reliable despite the un¬ 
reliability of its eomponent parts. For example, a parallel system of four 
identieal modules with the module reliability 0.95, has the system reliabil- 
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ity 1 — (1 — .95)^ = 0.99999375. Clearly, however, the eost of the parallelism 
ean be high. 

5.1.2 Availability computation 

If we assume that the failure and repair times are independent, then we 
ean use reliability bloek diagrams to eompute the system availability. This 
situation oeeurs when the system has enough spare resourees to repair all the 
failed eomponents simultaneously. Given a system eonsisting of n eomponents 
with Ai{t) being the availability of the /th eomponent, the availability if the 
overall system is given by 


A{t) 


n”=i A,(t) for a series strueture, 

1 — riLi (1 for ^ parallel strueture. 


(3.17) 


The eombined availability of two eomponents in series is always lower than 
the availability of the individual eomponents. For example, if one eomponent 
has the availability 99% (3.65 days/year downtime) and another eomponent has 
the availability 99.99% (52 minutes/year downtime), then the availability of the 
system eonsisting of these two eomponents in series is 98.99% (3.69 days/year 
downtime). On the eontrary, a parallel system eonsisting of three identieal 
eomponents with individual availability 99% has an availability of 99.9999 (31 
seeonds/year downtime). 


5.2 Computation using Markov processes 

In this seetion we show how Markov proeesses are used to evaluate system 
dependability. Continuous Time Markov Chains are the most important elass 
of Markov proeesses for dependability analysis, so the presentation is foeused 
on this model. 

The aim of Markov proeesses analysis is to ealeulate Pi{t), the probability 
that the system is in state i at time t. Onee this is known, the system reliability, 
availability or safety ean be eomputed as a sum taken over all the operating 
states. 

Let us designate state 1 as the state in whieh all the eomponents are opera¬ 
tional. Assuming that at t = 0 the system is in state 1, we get 

Fi(0) = l. 

Sinee at any time the system ean be only in one state, F,(0) = 0, V/ 1, and we 
have 

£ Pi{t) = h (3.18) 

ieOUF 

where the sum is over all possible states. 
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To determine the Pi{t), we derive a set of differential equations, one for 
eaeh state of the system. These equations are ealled state transition equations 
beeause they allow the Pi{t) to be determined in terms of the rates (failure, 
repair) at whieh transitions are made from one state to another. State transition 
equations are usually presented in matrix form. The matrix M whose entry m,y 
is the rate of transition between the states i and j is ealled the transition matrix 
assoeiated with the system. We use the first index i for the eolumns of the 
matrix and the seeond index j for the rows, i.e. M has the following strueture 

mil mi ■■■ mi 
mu ni 22 ■■■ niki 

mik m2k ... nikk 

where k is the number of states in the state transition diagram representing the 
system. In reliability or availability analysis the eomponents of the system are 
normally assumed to be in either operational or failed states. So, if a system 
eonsists of n eomponents, then k<2". In safety analysis, where the system ean 
fail in either a safe or an unsafe way, k ean be up to 3”. The entries in eaeh column 
of the transition matrix must sum up to 0. So, the entries m,,- corresponding to 
self-transitions are computed as — for all y € {1,2,... ^} such that j ^ i. 

For example, the transition matrix for the state transition diagram of a single¬ 
component system shown in Figure 3.6 is 

M=[-^;;]. (3.19) 

The rate of transition between the states 1 and 2 is X, therefore the mi 2 = X. 
Hence, mu = —X. The rate of transition between the states 2 and 1 is 0, so 
m 2 i = 0 and thus m 22 = 0. 

Similarly, the transition matrix for the state transition diagram in Figure 3.7, 
which incorporates repair, is 

M=[-^ -*;]■ (3-20) 

The transition matrix for the state transition diagram in Figure 3.8, is of size 
3x3, since, for safety analysis, the system is modeled to be in three different 
states: operational, failed-safe failed-unsafe. 

-X 0 0" 

M= XC 0 0. (3.21) 

?i(l-C) 0 0 
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The transition matrix for the simplified state transition diagram of the two- 
eomponent system, shown in Figure 3.10 is 

' -2X 0 O' 

M= 2X -X 0 . (3.22) 

_ 0 X 0 _ 

The examples above illustrate two important properties of transition matriees. 
One, whieh we have mentioned before, is that the sum of the entries in eaeh 
eolumn is zero. Positive sign of an /y'the entry indieates that the transition 
originates in the /th state. Negative sign of an /y'the entry indieates that the 
transition terminates in the /th state. 

Seeond property of the transition matrix is that it allows us to distinguish 
between the operational and failed states. In reliability analysis, onee a system 
fails, the failed state eannot be leaved. Therefore, eaeh failed state i has a zero 
diagonal element m,-,-. This is not the ease, however, when availability or safety 
are eomputed, as one ean see from (3.20) and (3.21). 

Using state transition matriees, state transition equations are derived as fol¬ 
lows. Let P(t) be a veetor whose /th element is the probability P,(/) that the 
system is in state / at time t. Then the matrix representation of a system of state 
transition equations is given by 

^P(t) = M-P(t). (3.23) 

Onee the system of equations is solved and the P,(/) are known, the system’s 
reliability, availability or safety ean be eomputed as a sum taken over all the 
operating states. 

We illustrate the eomputation proeess on a number of simple examples. 


5.2.1 Reliability evaluation 
Independent components case 

Let us first eompute the reliability of a parallel system eonsisting of two inde¬ 
pendent eomponents whieh we have eonsidered before (Figure 3.9). Applying 
(3.23) to the matrix (3.22) we get 


dt 


P2{t) 

P3{t) 


-2X 0 0 1 r Pi{t) ' 
2X -X 0 ■ P-iit) 

0 0 J [ P^{t) 


The above matrix form represents the following system of state transition equa¬ 
tions 

r |Pi(t) = - 2 XPi(/) 

\ iP2{t)=2XPi{t)-XP2{t) 

[ iP,{t)=XP2{t) 
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By solving this system of equations, we get 

Pi(?) = e-2^ 

P2{t) = 2e-^ -2e-^^ 

P3{t) = l-2e-^ + e-2^ 

Sinee the B,(t) are known, we ean now ealeulate the reliability of the system. 
For the parallel eonfiguration, both eomponents should fail to have a system 
failure. Therefore, the reliability of the system is the sum of probabilities Pi {t) 

andF2(l): 

Rparallel{t)=2e-^ (3.24) 

In general ease, the reliability of the system is eomputed as a funetion using the 
equation 

R{t) = Y,Pi{t), (3.25) 

ieo 

where the sum is taken over all the operating states O. Alternatively, the relia¬ 
bility ean be ealeulated 

ieF 

where the sum is taken over all the states F in whieh the system has failed. 

Note that, for eonstant failure rates, the eomponent reliability is R{t) = e~^ . 
Therefore, the equation (3.24) ean be written as Rparaiiei{t) = 2R^ — R, whieh 
agrees with the expression (3.16) derived using reliability bloek diagrams. Two 
results are the same, beeause in this example we assumed the failure rates to be 
mutually independent. 

Dependant components case 

The value of Markov proeesses beeomes evident in situations in whieh eom¬ 
ponent failure rates are no longer assumed to be independent of the system state. 
One of the common cases of dependence is load-sharing components, which 
we consider next. Another possibility is the case of standby components, which 
is considered in the availability computation section. 

The word load is used in a broad sense of the stress on a system. This can be 
an electrical load, a load caused by high temperature, or an information load. 
In practice, failure rates are found to increase with loading. Suppose that two 
components share a load. If one of the component fails, the additional load on 
the second component is likely to increase its failure rate. 

To model load-sharing failures, consider the state transition diagram of a 
two-component parallel system shown in Figure 3.11. As before, we have four 
states. However, after one component failure, the failure rate of the second 
component increases. The increased failure rates of the components 1 and 2 
are denoted with and ^ 2 , respectively. 
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Figure 3.11. State transition diagram of a two-component parallel system with load sharing. 


From the state transition diagram in Figure 3.11, we ean derive the state 
transition equations for Pi{t). In the matrix form they are 



■ Pi{t) ' 


—Xl — A,2 0 

0 

0 ■ 


■ A(1) ■ 

d 

Piit) 


Xi — ^<2 

0 

0 


A(1) 

dt 

P^it) 


A,2 0 


0 


AW 


AW 


0 


0 


A(t) 


By expanding the matrix form, we get the following system of equations 

iP2{t)=X2Pl{t)-X[P3{t) 

^ iP,{t)=X'^P2{t)+X\P3{t). 


IS 


The solution of this system of equation 

Fi(f) 

—_ h _ h 

— Xi+X2-)l2 h+h-llz 


_ h __|_ h _ 

i TtC TTjTTT Tre 


Finally, sinee both eomponents should fail for the system to fail, the reliability 
is equal to 1 —P 4 {t), yielding the expression 




+ 


^l-l-A.2— X\-\-X2 — 


(3.26) 
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If X\ = Xi and X 2 = X 2 , the above equation is equal to (3.24). The effeet of the 
inereased loading ean be illustrated as follows. Assume that the two eomponents 
are identieal, i.e. = X 2 = X and = X 2 = X'. Then, the equation (3.26) 
reduees to 

, , 2X _5,/. X _71* 

Rparallelit) = ' 



Figure 3.12. Reliability of a two-component parallel system with load sharing. 


Figure 3.12 shows the reliability of a two-eomponent parallel system with 
load-sharing for different values of X'. The reliability of a single-eomponent 

system is also plotted for a eomparison. In the ease where X' = X two two eom¬ 
ponents are independent, so the reliability is given by (3.23). X' = 00 is the ease 
of total dependeney. This means that the failure of one eomponent would bring 
the immediate failure of the other eomponent. So, in this ease, the reliability 
will be equal to the reliability of a serial system with two eomponents (3.16). 
It ean been seen that, the more the value of X' exeeeds the value of X, the eloser 
the reliability of this system approaehes the reliability of the serial system with 
two eomponents. 

5.2.2 Availability evaluation 

In availability analysis, as well as in reliability analysis, there are situa¬ 
tions in whieh the eomponent failures eannot be eonsidered independent of one 
another. These inelude shared-load systems and systems with standby eompo¬ 
nents, whieh are repairable. 

The dependeneies between eomponent failures ean be analyzed using Markov 
methods, provided that the failures are deteeted and that the failure and repair 
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rates are time-independent. There is a fundamental differenee between treat¬ 
ment of repair for reliability and availability analysis. In reliability ealeulations, 
eomponents are allowed to be repaired only as long as the system has not failed. 
In availability ealeulations, the eomponents ean also be repaired after the system 
failure. 

The differenee is best illustrated on a simple example of a system with two 
eomponents, one primary and one standby. The standby eomponent is held in 
reserve and only brought to operation when the primary eomponent fails. We 
assume that there is a perfeet fault deteetion unit whieh deteets a failure in the 
primary eomponent and replaees it by the standby eomponent. We also assume 
that the standby eomponent eannot fail while it is in the standby mode. 

The state transition diagrams of the standby system for reliability and avail¬ 
ability analysis are shown in Figure 3.13(a) and (b), respeetively. The states 
are numbered aeeording to the Table 3.4. When the primary eomponent fails. 


Component 

1 2 

State 

Number 

0 

0 

1 

F 

0 

2 

F 

F 

3 


Table 3.4. Markov states of a simplified state transition diagram of a two-component parallel 
system incorporating repair. 


there is a transition between states 1 and 2. If the system is in state 2 and the 
backup component fails, there is a transition to state 3. Since we assumed that 
the backup unit cannot fail while in the standby mode, the combination {0,F) 
cannot occur. States 1 and 2 are operational states. State 3 is the failed state. 


Ai Ai Ai 



p p p 


(a) (b) 

Figure 3.13. Sate transition diagrams for a standby two-component system (a) for reliability 
analysis, (b) for availability analysis. 


Suppose the primary unit can be repaired with a rate p. For reliability anal¬ 
ysis, this implies that a transition between states 2 and 1 is possible. The 
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corresponding transition matrix if given by 

—Xi fj 0 

M= -X 2-11 0 . 

0 X 2 0 

For availability analysis, we should be able to repair the backup unit as well. 
This adds a transition between states 3 and 2. If we assume that the repair rates 
for the primary and the backup units are the same, and also that the backup unit 
is repaired first, then the corresponding transition matrix if given by 

-Xi /r 0 

M= Xi -X 2 -I 1 F . (3.27) 

_ 0 X 2 -F _ 

One can see that, in the matrix for availability calculations, none of the 
diagonal elements is zero. This is because the system should be able to recover 
from the failed state. By solving the system of state transition equations, we 
can get Pi{t) and compute the availability of the system as 

A(t) = l-£F,(0, (3-28) 

ieF 

where the sum is taken over all the failed states F. 

Usually, the steady state availability rather than the time dependent avail¬ 
ability is of interest. The steady state availability can be computed in a simpler 
way. We note that, as time approaches infinity, the derivative on the right-hand 
side of the equation 3.23 vanishes and we get a time-independent relationship 

M-P(oo) = 0. (3.29) 

In our example, for matrix (3.27) this represents a system of equations 

{ —X\P\{o°) -\-fP2{°°) = 0 

XiFi(oo) - (X2-|-q)F2(°°)+A'P3(°°) =0 

^ 2 ^ 2 (°°) -FPi>{°°) = 0 

Since these three equations are linearly dependent, they are not sufficient to 
solve for P{°°). The needed piece of additional information is the condition 
(3.18) that the sum of all probabilities is one: 

= 1- (3-30) 
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If we assume Xi =^2 = X, then we get 


X\2 


PiH= i + ^ + (^) 


-1 


P2H= i + H(^) 


X\2 


1-1 


>112 


P3H= i + J + (J) 


1-1 




The steady-state availability ean be found by setting t = oo in (3.28) 


A(oo) = 1 — 


p p ^ 


If we further assume that X/p << 1, we can write 


1' 


X 


To summarize, steady-state availability problems are solved by the same pro¬ 
cedure as time-dependent availability. Any n — 1 of the n equations represented 
by (3.29) are combined with the condition 3.30 to solve for the components of 
P(oo). These are then substituted into (3.28) to obtain availability. 


5,2.3 Safety evaluation 

The main difference between safety calculation and reliability calculation is 
in the construction of the state transition diagram. As we mentioned before, 
for safety analysis, the failed state is splitted into failed-safe and failed-unsafe 
ones. Once the state transition diagram for a system is derived, the state transi¬ 
tion equations are obtained and solved using same procedure as for reliability 
analysis. 

As an example, consider the single component system shown in Figure 3.8. 
Its state transition matrix is given by (3.21). So, the state transition equations 
for Pi{t) are given by 


A(t) ■ 


1 

O 

O 

1 

i_ 


-1 

P2{t) 

= 

o 

o 


P2{t) 

\ 


1 

1 

o 

o 

1_ 


-1 

_1 


The solution of this system of equations is 

Pi(t)=e-^ 

P2{t) =C-Ce-^ 

P 3 (t) = (l-C)-(1-C)e-^ 
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The safety of the system is the sum of probabilities of being in the operational 
or fail-safe states, i.e. 

S{t)=Pi{t)+P2{t)=C+{l-C)e-^ 

At time t = 0 the safety of the system is 1, as expeeted. As time approaehes 
infinity, the safety approaehes the fault deteetion eoverage, ^(O) = C. So, if 
C = 1, the system has a perfeet safety. 

6. Problems 

3 . 1 . Why is dependability evaluation important? 

3 . 2 . What is the differenee between qualitative and quantitative evaluation? 

3 . 3 . Define the failure rate. How ean the failure rate of a non-redundant system 
ean be eomputed from the failure rates of its eomponents? 

3 . 4 . How does a typieal evolution of failure rate over a system’s life-time differ 
for hardware and software? 

3.5. What is the mean time to failure of a system? How ean the MTTF of a 
non-redundant system be eomputed from the MTTF of its eomponents? 

3 . 6 . What is the differenee between the mean time to repair and the mean time 
between failures? 

3 . 7 . A heart paee-maker has a eonstant failure rate of X = 8.167/10^ hr. 

(a) What is the probability that it will fail during the first year of operation? 

(b) What is the probability that it will fail within 5 years of operation? 

(b) What is the MTTF? 

3 . 8 . A logie eireuit with no redundaney eonsists of 16 two-input NAND gates 
and 3 J-K flip-flops. Assuming the eonstant failure rates of a two-input 
NAND gate and a J-K flip-flop are 0.2107 and 0.4667 per hour, respeetively, 
eompute 

(a) the failure rate of the logie eireuit, 

(b) the reliability for a 72-hour mission, 

(e) the MTTF. 

3 . 9 . An automatie teller maehine manufaeturer determines that his produet has 
a eonstant failure rate of X = 77.16 per 10^ hours in normal use. For how 
long should the warranty be set if no more than 5% of the maehines are to 
be returned to the manufaeturer for repair? 
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3 . 10 , A car manufacturer estimates that the reliability of his product is 99% during 
the first 7 years. 

(a) How many cars will need a repair during the first year? 

(b) What is the MTTF? 

3 . 11 , A two-year guarantee is given on a TV-set based on the assumption that no 
more than 3% of the items will be returned for repair. Assuming exponential 
failure law, what is the maximum failure rate that can be tolerated? 

3 . 12 , A DVD-player manufacturer determines that the average DVD set is used 
930 hr/year. A two-year warranty is offered on the DVD set having MTTF 
of 2500 hr. If the exponential failure law holds, which fraction of DVD-sets 
will fail during the warranty period? 

3 . 13 , Suppose the failure rate of a jet engine is X = 10“^ per hour. What is the 
probability that more than two engines on a four-engine aircraft will fail 
during a 4-hour flight? Assume that the failures are independent. 

3 . 14 , A non-redundant system with 50 components has a design life reliability of 
0.95. The system is re-designed so that it has only 35 components. Estimate 
the design life reliability of the re-designed system. Assume that all the 
components have constant failure rate of the same value and the failures are 
independent. 

3 . 15 , At the end of the year of service the reliability of a component with a constant 
failure rate is 0.96. 

(a) What is the failure rate? 

(b) If two components are put in parallel, what is the one year reliability? 
Assume that the failures are independent. 

3 . 16 , A lamp has three 25V bulbs. The failure rate of a bulb is X = 10“^ per year. 
What is the probability that more than one bulb fail during the first month? 

3 . 17 , Suppose a component has a failure rate of X = 0.007 per hour. How many 
components should be placed in parallel if the system is to run for 200 hours 
with failure probability of no more than 0.02? Assume that the failures are 
independent. 

3 . 18 , Consider a system with three identical components with failure rate X. Find 
the system failure rate Xsys for the following cases: 

(a) All three components in series. 

(b) All three components in parallel. 
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(c) Two components in parallel and the third in series. 

3.19. The MTTF of a system with constant failure rate has been determined to 
be 1000 hr. An engineer is to set the design life time so that the end-of-life 
reliability is 0.95. 

(a) Determine the design life time. 

(b) If two systems are placed in parallel, to what value may the design life 
time be increased without decreasing the end-of-life reliability? 

3.20. A printer has an MTTF = 168 hr and MTTR = 3 hr. 

(a) What is the steady state availability? 

(b) If MTTR is reduced to 1 hr, what MTTF can be tolerated without de¬ 
creasing the availability of the printer? 

3.21. A copy machine has a failure rate of 0.01 per week. What repair rate should 
be maintained to achieve a steady state availability of 95%? 

3.22. Suppose that the steady state availability for standby system should be 0.9. 
What is the maximum acceptable value of the failure to repair ratio X//r? 

3.23. A computer system is designed to have a failure rate of one fault in 5 years. 
The rate remains constant over the life of the system. The system has no 
fault-tolerance capabilities, so it fails upon occurrence of the first fault. For 
such a system: 

(a) What is the MTTF? 

(b) What is the probability that the system will fail in its first year of oper¬ 
ation? 

(c) The vendor of this system wishes to offer insurance against failure for 
the three years of operation of the system at some extra cost. The vendor 
determined that it should charge $200 for each 10 % drop in reliability 
to offer such an insurance. How much should the vendor charge to the 
client for such an insurance? 

3.24. What are the basic assumptions regarding the failures of the individual com¬ 
ponents (a) in reliability block diagram model; (b) in Markov process model? 

3.25. A system consists of three modules: Mi,M 2 and M 3 . It was analyzed, and 
the following reliability expression was derived: 

^system R1R3+ R2R3 ~ R1R2R3 
Draw the reliability block diagram for this system. 
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3 . 26 . A system with four modules: A,B,C and D is eonneeted so that it operates 
eorreetly only if one of the two eonditions is satisfied: 

■ modules A and D operate eorreetly, 

■ module A operates eorreetly and either B or C operates eorreetly. 
Answer the following questions: 

(a) Draw the reliability bloek diagram of the above system sueh that every 
module appears only onee. 

(b) What is the reliability of the system? Assume that the reliability of the 
module X is R{X) and that the modules fail independently from eaeh 
other. 

(e) What is the reliability of the above system as a funetion of time t, if 
the failure rates of the eomponents are Xa = ^b = ^c = ^d = 0.01 per 
year? Assume that the exponential failure law holds. 

3 . 27 . How many states has a non-simplified state transition diagram of a system 
eonsisting of n eomponents? Assume that every eomponent has only two 
states: operational and failed. 

3 . 28 . Construe! the Markov model of the three-eomponent system shown in Figure 
3.5. Assume that the eomponents are independent and non-repairable. The 
failure rate of the proeessors 1 and lisXp. The failure rates of the memory 
is Xm- Derive and solve the system of state transition equations representing 
this system. Compute the reliability of the system. 

3 . 29 . Construe! the Markov model of the three-eomponent system shown in Figure 
3.5 for the ease when a failed proeessor ean be repaired. Assume that the 
eomponents are independent and that a proeessor ean be repaired as long as 
the system has not failed. The failure rate of the proeessors 1 and 2 is Xp. 
The failure rate of the memory is Xm- Derive and solve the system of state 
transition equations representing this system. Compute the reliability of the 
system. 

3 . 30 . What is the differenee between treatment of repair for reliability and avail¬ 
ability analysis? 

3 . 31 . Construe! the Markov model of the three-eomponent system shown in Figure 
3.5 for availability analysis. Assume that the eomponents are independent 
and that the proeessors and the memory ean be repaired after the system 
failure. The failure rate of the proeessors 1 and 2 is Xp. The failure rate of 
the memory is Xm- Derive and solve the system of state transition equations 
representing this system. Compute the reliability of the system. 
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3.32. Suppose that the reliability of a system consisting of 4 blocks, two of which 
are identical, is given by the following equation: 

^system — R1R2R3 + 

Draw the reliability block diagram representing the system. 

3.33. Construct a Markov chain and write a transition matrix for self-purging 
redundancy with 3 modules, for the cases listed below. For all cases, assume 
that the component’s failures are independent events and that the failure rate 
of each module is X. For all cases, simplify state transition diagrams as 
much as you can. 

(a) Do reliability evaluation, assuming that the voter and switches are per¬ 
fect and no repairs are allowed. 

(b) Do reliability evaluation, assuming that the voter can fail with the failure 
rate Xy, the switches are perfect, and no repairs are allowed. 

(c) Do reliability evaluation, assuming that the voter and switches are per¬ 
fect and repairs are allowed. Assume that each module has its own 
repair crew (i.e. that the component’s repairs are independent events) 
and that the repair rate of each module is q. 

(d) Do availability evaluation, assuming that the voter and switches are 
perfect and repairs are allowed. Assume that there is a single repair 
crew for all modules that can handle only one module at a time and that 
the repair rate of each module is q. 
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HARDWARE REDUNDANCY 


Those parts of the system that you can hit with a hammer (not advised) are called hardware; 
those program instructions that you can only curse at are called software. 

—Anonymous 


1. Introduction 

Hardware redundancy is achieved by providing two or more physical in¬ 
stances of a hardware component. For example, a system can include redundant 
processors, memories, buses or power supplies. Hardware redundancy is often 
the only available method for improving the dependability of a system, when 
other techniques, such as better components, design simplification, manufac¬ 
turing quality control, software debugging, have been exhausted or shown to 
be more costly than redundancy. 

Originally, hardware redundancy techniques were used to cope with the low 
reliability of individual hardware elements. Designers of early computing sys¬ 
tems replicated components at gate and flip-flop levels and used comparison 
or voting to detect or correct faults. As reliability of hardware components 
improved, the redundancy was shifted at the level of larger components, such 
as whole memories or arithmetic units, thus reducing the relative complexity 
of the voter or comparator with respect to that of redundant units. 

There are three basic forms of hardware redundancy: passive, active and 
hybrid. Passive redundancy achieves fault tolerance by masking the faults that 
occur without requiring any action on the part of the system or an operator. 
Active redundancy requires a fault to be detected before it can be tolerated. 
After the detection of the fault, the actions of location, containment and recov¬ 
ery are performed to remove the faulty component from the system. Hybrid 
redundancy combine passive and active approaches. Fault masking is used to 


DRAFT 


March 25, 2008, 2:12ain 


DRAFT 



48 


FAULT TOLERANT DESIGN: AN INTRODUCTION 


prevent generation of erroneous results. Fault detection, location and recovery 
are used to replace the faulty component with a spare component. 

Hardware redundancy brings a number of penalties: increase in weight, size, 
power consumption, as well as time to design, fabricate, and test. A number of 
choices need to be examined to determine a best way to incorporate redundancy 
into the system. For example, weight increase can be reduced by applying 
redundancy to the lower-level components. Cost increase can be minimized 
if the expected improvement in dependability reduces the cost of preventive 
maintenance for the system. In this section, we examine a number of different 
redundancy configurations and calculate the effect of redundancy on system 
dependability. We also discuss the problem of common-mode failures which 
are caused by faults occurring in a part of the system common to all redundant 
components. 

2. Redundancy allocation 

The use of redundancy does not immediately guarantee an improvement in 
the dependability of a system. The increase in weight, size and power consump¬ 
tion caused by redundancy can be quite severe. The increase in complexity may 
diminish the dependability improvement, unless a careful analysis is performed 
to show that a more dependable system can be obtained by allocating the re¬ 
dundant resources in a proper way. 

A number of possibilities have to be examined to determine at which level 
redundancy needs to be provided and which components need to be made re¬ 
dundant. To understand the importance of these decisions, consider a serial 
system consisting of two components with reliabilities R i and R 2 . If the system 
reliability R = R\R 2 does not satisfy the design requirements, the designer may 
decide to make some of the components redundant. The possible choices of 
redundant configurations are shown in Figure 4.1(a) and (b). Assuming the 
component failures are mutually independent, the corresponding reliabilities of 
these systems are 

Ra = {2Ri-R\)R2 
Rb={2R2-R^^Ri 



(a) (b) 


Figure 4.1. Redundancy allocation. 
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Taking the difference of and Ra, we get 

Ra — Rb = ^1^2 (^2 —^l) 

It follows from this expression that the higher reliability is achieved if we 
duplicate the component that is least reliable. If i < R 2 , then configuration 
Figure 4.1(a) is preferable, and vice versa. 

Another important parameter to examine is the level of redundancy. Consider 
the system consisting of three serial components. In high-level redundancy, the 
entire system in duplicated, as shown in Figure 4.2(a). In low-level redundancy, 
the duplication takes place at component level, as shown in Figure 4.2(b). If 
each of the block of the diagram is a subsystem, the redundancy can be placed 
at even lower levels. 



(a) (b) 

Figure 4.2. High-level and low-level redundancy. 


Let us compare the reliabilities of the systems in Figures 4.2(a) and (b). 
Assuming that the component failures are mutually independent, we have 

Ra = l-{l-RlR2R3)^ 

/?, = (1 - (1-/?l)2)(l - (1-/?2)2)(1 - (1-/?3)') 

The system in Figure 4.2(a) is a parallel combination of two serial sub-systems. 
The system in Figure 4.2(b) is a serial combination of three parallel sub-systems. 
As we can see, the reliabilities Ra and Rt, differ, although the systems have the 
same number of components. If Ra = Rb = Rc^ then the difference is 

Rb-Ra = 6RH\-Rf 

Consequently, Rb > Ra, i-e. low-level redundancy yields higher reliability than 
high-level redundancy. However, this dependency only holds if the components 
failures are truly independent in both configurations. In reality, common-mode 
failures are more likely to occur with low-level rather than with high-level 
redundancy, since in high-level redundancy the redundant units are normally 
isolated physically and therefore are less prone to common sources of stress. 

3. Passive redundancy 

Passive redundancy approach masks faults rather than detect them. Masking 
insures that only correct values are passed to the system output in spite of the 


DRAFT 


March 25, 2008, 2:12ain 


DRAFT 















50 


FAULT TOLERANT DESIGN: AN INTRODUCTION 


presence of a fault. In this section we first study the concept of triple modular 
redundancy, and then extend it to a more general case of N-modular redundancy. 

3.1 Triple modular redundancy 

The most common form of passive hardware redundancy is triple modular 
redundancy (TMR). The basic configuration is shown in Figure 4.3. The com¬ 
ponents are triplicated to perform the same computation in parallel. Majority 
voting is used to determine the correct result. If one of the modules fails, the 
majority voter will mask the fault by recognizing as correct the result of the 
remaining two fault-free modules. Depending on the application, the triplicated 
modules can be processors, memories, disk drives, buses, network connections, 
power supplies, etc. 


input 1 
input 2 
input 3 



output 


Figure 4.3. Triple modular redundancy. 


A TMR system can mask only one module fault. A failure in either of the 
remaining modules would cause the voter to produce an erroneous result. In 
Section 5 we will show that the dependability of a TMR system can be improved 
by removing failed modules from the system. 

TMR is usually used in applications where a substantial increase in reliability 
is required for a short period. For example, TMR is used in the logic section 
of launch vehicle digital computer (LVDC) of Saturn 5. Saturn 5 is a rocket 
carrying Apollo spacecrafts to the orbit. The functions of LVDC include the 
monitoring, testing and diagnosis of rocket systems to detect possible failures or 
unsafe conditions. As a result of using TMR, the reliability of the logic section 
for a 250-hr mission is approximately twenty times larger than the reliability 
of an equivalent simplex system. However, as we see in the next section, for 
longer duration missions, a TMR system is less reliable than a simplex system. 

3.1.1 Reliability evaluation 

The fact that a TMR system which can mask one module fault does not 
immediately imply that the reliability of a TMR system is higher than the 
reliability of a simplex system. To estimate the influences of TMR on reliability. 
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we need to take the reliability of modules as well as the duration of the mission 
into aeeount. 

A TMR system operates eorreetly as long as two modules operate eorreetly. 
Assuming that the voter is perfeet and that the eomponent failures are mutually 
independent, the reliability of a TMR systems is given by 

Rtmr = R1R2R3 + (1 — Ri ) R 2 R 3 + ^ i (1 —^2)^3 +^1^2(1 —R3) 

The term /?iR 2^3 gives the probability that the first module funetions eorreetly 
and the seeond module funetions eorreetly and the third module funetions eor¬ 
reetly. The term (1 — Ri)/? 2^3 stands for the probability that the first module 
has failed and the seeond module funetions eorreetly and the third module fune¬ 
tions eorreetly, ete. The overall probability is an or of the probabilities of the 
terms sinee the events are mutually exelusive. If Ri = R 2 = R 3 = R, the above 
equation reduees to 

Rtmr = 3R^-2R^ (4.1) 

Figure 4.4 eompares the reliability of a TMR system Rjmr to the reliability 
of a simplex system eonsisting of a single module with reliability R. The 



Figure 4.4. TMR reliability compared to simplex system reliability. 


reliabilities of the modules eomposing the TMR system are assumed to be 
equal R. As ean be seen, there is a point at whieh Rjmr = R- This point ean 
be found by solving the equation 3/?^ — 2R^ = R. The three solutions are 0.5, 1 
and 0, implying that the reliability of a TMR system is equal to the reliability 
of a simplex system when the reliability of the module is R = 0.5, when the 
module is perfeet {R = 1), or when the module is failed (R = 0). 
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This further illustrates a differenee between fault toleranee and reliability. A 
system ean be fault-tolerant and still have a low overall reliability. For example, 
a TMR system build out of poor-quality modules with R = 0.2 will have a low 
reliability of Rtmr = 0.136. Viee versa, a system whieh eannot tolerate any 
faults ean have a high reliability, e.g. when its eomponents are highly reliable. 
However, sueh a system will fail as soon as the first fault oeeurs. 

Next, let us eonsider how the reliability of a TMR system ehanges as a 
funetion of time. For a eonstant failure rate X, the reliability of the system 
varies exponentially as a funetion of time R{t) = e~^ (3.9). Substituting this 
expression in (4.1), we get 

RTMR{t) = ?,e-^^ (4.2) 


Figure 4.5 shows how the reliabilities of simplex and TMR systems ehange as 
funetions of time. The value of Xf, rather than t is shown on the x-axis, to make 



Figure 4.5. TMR reliability as a function of Xt. 


the eomparison independent of the failure rate. Reeall that l/X = MTTF (3.12), 
so that the point Xt = I eorresponds to the time when the system is expeeted to 
experienee the first failure. One ean see that the reliability of the TMR system 
is higher than the reliability of the simplex system in the period between 0 
and approximately O.lXt. That is why TMR is suitable for applieations whose 
mission time is shorter than 0.7 of MTTF. 

3.1.2 Voting techniques 

In the previous seetion we evaluated the reliability of a TMR system assuming 
that the voter is perfeet. Clearly, sueh an assumption is not realistie. A more 
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precise estimation of the reliability of a TMR system takes the reliability of the 
voter into account: 

Rtmr = {^R^-2R^)Rv 

The voter is in series with the redundant modules, since if it fails, the whole 
system fails. The reliability of the voter must be very high in order to keep 
the overall reliability of the TMR system higher than the reliability of a cor¬ 
responding simplex system. Fortunately, the voter is typically a very simple 
device compared to the redundant components and therefore its failure prob¬ 
ability is much smaller. Still, in some systems the presence of a single point 
of failure is not acceptable by qualitative requirement specifications. We call 
single point of failure any component within a system whose failure leads to 
the failure of the system. In such cases, more complicated voting shemes are 
used. One possibility is to decentralize voting by having three voters instead 
of one, as shown in Figure 4.6. Decentralized voting avoids the single point of 
failure, but requires establishing consensus among three voters. 


input 1 
input 2 
input 3 



output 1 
output 2 
output 3 


Figure 4.6. TMR system with three voters. 


Another possibility is the so called master-slave approach that replaces a 
failed voter with a standby voter. 

Voting heavily relies on an accurate timing. If values arrive at a voter at 
different times, incorrect voting result may be generated. Therefore, a reliable 
time service should be provided throughout a TMR or NMR system. This can 
be done either by using additional interval timers, or by implementing asyn¬ 
chronous protocols that rely on the progress of computation to provide an esti¬ 
mate of time. Multiple-processor systems should either provide a fault-tolerant 
global clock service that maintains a consistent source of time throughout the 
system, or to resolve time conflicts on an ad-hoc basis. 

Another problem with voting is that the values that arrive at a voter may 
not completely agree, even in a fault-free case. For example, analog to digital 
converters may produce values which slightly disagree. A common approach 
to overcome this problem is to accept as correct the median value which lies be¬ 
tween the remaining two. Another approach is to ignore several least significant 
bits of information and to perform voting only on the remaining bits. 
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Voting can be implemented in either hardware or software. Hardware voters 
are usually quick enough to meet any response deadline. If voting is done by 
software voters that must reach a consensus, adequate time may not be available. 
A hardware majority voter with 3 inputs for digital data is shown in Figure 4.7. 
The value of the output / is determined by the majority of the input values 



Figure 4.7. Logic diagram of a majority voter with 3 inputs. 


xi,V 2 ,X 3 . The defining table for this voter is given in Table 4.1. 


Xl 

•L2 

•L3 

/ 

0 

0 

0 

0 

0 

0 

1 

0 

0 

1 

0 

0 

0 

1 

1 

1 

1 

0 

0 

0 

1 

0 

1 

1 

1 

1 

0 

1 

1 

1 

1 

1 


Table 4.1. Defining table for 2-out-of-3 majority voter. 


3.2 N-modular redundancy 

V-modular redundancy (NMR) approach is based on the same principle as 
TMR, but uses n modules instead of three (Figure 4.8). The number n is usually 
selected to be odd, to make majority voting possible. A NMR system can mask 
[V/2J module faults. 

Figure 4.9 plots the reliabilities of NMR systems for n = 1,3,5 and 7. Note 
that the v-axis shows the interval of time between 0 and Xf = 1, i.e. MTTF. 
This interval of time is of most interest for reliability analysis. As expected, 
larger values of n result in a higher increase of reliability of the system. At time 
approximately O.l'kt, the reliabilities of simplex, TMR, 5MR and 7MR system 
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input 1 


input 2 


input 3 



output 


Figure 4.8. Af-modular redundancy. 


become equal. After O.VXt, the reliability of a simplex system is higher than 
the reliabilities of redundant systems. So, similarly to TMR, NMR is suitable 
for applications with short mission times. 



Figure 4.9. Reliability of an NMR system for different values of n. 


4. Active redundancy 

Active redundancy achieves fault tolerance by first detecting the faults which 
occur and then performing actions needed to recover the system back to the oper¬ 
ational state. Active redundancy techniques are common in applications where 
temporary erroneous results are preferable to the high degree of redundancy 
required to achieve fault masking. Infrequent, occasional errors are allowed, as 
long as the system recovers back to normal operation in a specified inferval of 
time. 
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In this section we consider three common active redundancy techniques: du¬ 
plication with comparison, standby sparing and pair-and-a-spare, and examine 
the effect of redundancy on system dependability. 

4.1 Duplication with comparison 

The basic form of active redundancy is duplication with comparison shown 
in Figure 4.10. Two identical modules perform the same computation in par¬ 
allel. The results of the computation are compared using a comparator. If the 
results disagree, an error signal is generated. Depending on the application, the 
duplicated modules can be processors, memories, I/O units, etc. 


input 1 


input 2 



output 
error signal 


Figure 4.10. Duplication with comparison. 


A duplication with comparison scheme can detect only one module fault. 
After the fault is detected, no actions are taken by the system to return back to 
the operational state. 

4.1.1 Reliability evaluation 

A duplication with comparison system functions correctly only until both 
modules operate correctly. When the first fault occurs, the comparator detects 
a disagreement and the normal functioning of the system stops, since the com¬ 
parator is not capable to distinguish which of the results is the correct one. 
Assuming that the comparator is perfect and that the component failures are 
mutually independent, the reliability of the system is given by 


Rdc — Ri'Ri 


(4.3) 


or, RDQ — if/?i — /?2 — R- 

Figure 4.11 compares the reliability of a duplication with comparison system 
Rdc to the reliability of a simplex system consisting of a single module with 
reliability R. It can been seen that, unless the modules are perfect {R{t) = 1), 
the reliability of a duplication with comparison system is always smaller than 
the reliability of a simplex system. 
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Figure 4.11. Duplication with comparison reliability compared to simplex system reliability. 


4.2 Standby sparing 

Standby sparing is another scheme for active hardware redundancy. The ba¬ 
sic configuration is shown in Figure 4.12. Only one of n modules is operational 
and provides the system’s output. The remaining n—l modules serve as spares. 
A spare is a redundant component which is not needed for the normal system 
operation. A switch is a device that monitors the active module and switches 
operation to a spare if an error is reported by fault-detection unit FD. 


input 1 


input 2 


input n 



output 


Figure 4.12. Standby sparing redundancy. 


There are two types of standby sparing: hot standby and cold standby. In 
the hot standby sparing, both operational and spare modules are powered up. 
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The spares ean be switehed into use immediately after the operational module 
has failed. In the cold standby sparing, the spare modules are powered down 
until needed to replaee the faulty module. A disadvantage of eold standby 
sparing is that time is needed to apply power to a module, perform initialization 
and re-eomputation. An advantage is that the stand-by spares do not eonsume 
power. This is important in applieations like satellite systems, where power 
eonsumption is eritieal. Hot standby sparing is preferable where the momentary 
interruption of normal operation due to reeonfiguration needs to be minimized, 
like in a nuelear plant eontrol system. 

A standby sparing system with n modules ean tolerate n — l module faults. 
Here by “tolerate” we mean that the system will deteet and loeate the faults, 
sueeessfully reeover from them and eontinue delivering the eorreet serviee. 
When the nth fault oeeurs, it will still be deteeted, but the system will not be 
able to reeover baek to normal operation. 

The standby sparing redundaney teehnique is used in many systems. One 
example is the Apollo spaeeeraft’s teleseope mount pointing eomputer. In this 
system, two identieal eomputers, an aetive and a spare, are eonneeted to a 
switehing deviee that monitors the aetive eomputer and switehes operation to 
the baekup in ease of a malfunetion. 

Another example of using standby sparing is Saturn 5 launeh vehiele digital 
eomputer (LVDC) memory seetion. The seetion eonsists of two memory bloeks, 
with eaeh memory being eontrolled by an independent buffer register and parity- 
eheeked. Initially, only one buffer register output is used. When a parity error 
is deteeted in the memory being used, operation immediately transfers to the 
other memory. Both memories are then re-generated by the buffer register of 
the "eorreet" memory, thus eorreeting possible transient faults. 

Standby sparing is also used in Compaq’s NonStop Himalaya server. The 
system is eomposed of a eluster of proeessors working in parallel. Eaeh proees- 
sor has its own memory and eopy of the operating system. A primary proeess 
and a baekup proeess are run on separate proeessors. The baekup proeess mir¬ 
rors all the information in the primary proeess and is able to instantly take over 
in ease of a primary proeessor failure. 

4.2.1 Reliability evaluation 

By their nature, standby systems involve dependeney between eomponents, 
sinee the spare units are held in reserve and only brought to operation in the 
event the primary unit fails. Therefore, standby systems are best analyzed 
using Markov models. We first eonsider an idealized ease when the switehing 
meehanism is perfeet. We also assume that the spare eannot fail while it is in the 
standby mode. Later, we eonsider the possibility of failure during switehing. 
Perfect switching case 
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input 1 


input 2 



output 


Figure 4.13. Standby sparing system with one spare. 


Consider a standby sparing scheme with one spare, shown in Figure 4.13. 
Let module 1 be a primary module and module 2 be a spare. The state transi¬ 
tion diagram of the system is shown in Figure 4.14. The states are numbered 
according to the Table 4.2. 


Component 

1 2 

State 

Number 

0 

0 

1 

F 

0 

2 

F 

F 

3 


Table 4.2. Markov states of the state transition diagram of a standby sparing system with one 
spare. 


When the primary component fails, there is a transition between state 1 and 
state 2. If a system is in state 2 and the spare fails, there is a transition to 
state 3. Since we assumed that the spare cannot fail while in standby mode, 
the combination {0,F) cannot occur. The states 1 and 2 are operational states. 
The state 3 is the failed state. 



Figure 4.14. State transition diagram of a standby sparing system with one spare. 


The transition matrix for the state transition diagram 4.14 is given by 


M = 


-Xi 0 0 

Xi —^2 0 

0 ?12 0 
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So, we get the following system of state transition equations 

d 
dt 

or 


Pl{t) ■ 


-1 

1 

0 

0 

_1 


-1 

Plit) 

= 

—^2 0 


Plit) 

1 


_ 0 X2 0 


-1 




iP2{^)=X,Pi{t)-X2P2{t) 

1^3(0 =^ 2 ^ 2(0 

By solving the system of equations, we get 

Pi(f) 

Sinee P 2 ,{t) is the only state eorresponding to system failure, the reliability of 
the system is the sum of Pi (t) and P 2 {t) 

Rss{t) = - ^“^'0 


^2 ~ 


This ean be re-written as 


Rss{t) = 

K 2 — Ai 


(4.4) 


Assuming {X 2 — )t < < 1, we ean expand the term e as a power series 

of —(X 2 — ii)t as 

= 1 _ (^,2 - Xi)t + 1 / 2(?12 - -... 

Substituting it in (4.4), we get 

Rss{t) = - l/2(Xi - ^ 2 ^ + ...). 

Assuming )i 2 = Xi, the above ean be simplified to 

= + (4.5) 


Next, let us see how the equation (4.5) would ehange if we would ignore the 
dependeney between the failures. If the primary and spare module failures are 
treated as mutually independent, the reliability of a standby sparing system is 
a sum of two probabilities: 
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1 The probability that module 1 operates eorreetly, and 

2 The probability that module 2 operates eorreetly, while module 1 has failed 
and has been replaeed by module 2. 

Then, we get the following expression: 

Rss = ^1 + (1 

If Ri = R 2 = R, then 

Rss = 2R-R^ 

or 

Rss{t)=2e-^-e-^^ (4.6) 


Figure 4.15 eompares the plots of the reliabilities (4.5) and (4.6). One ean 
see that negleeting the dependeneies between failures leads to underestimating 
the standby sparing system reliability. 



= 2e~^* + 

= (1 + Xt)e-^* 


Figure 4.15. Standby sparing reliability compared to simplex system reliability. 


Non-perfect switching case 

Next, we eonsider the ease when the switeh is not perfeet. Suppose that the 
probability that the switeh sueeessfully replaees the primary unit by a spare is p. 
Then, the probability that the switeh fails to do it is 1 — p. The state transition 
diagram with these assumptions is shown in Figure 4.16. The transition from 
state 1 is partitioned into two transitions. The failure rate is multiplied by p to 
get the rate of sueeessful transition to state 2. The failure rate is multiplied by 
1 — p to get the rate of the switeh failure. 
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(1 -i3)Ai 



Figure 4.16. State transition diagram of a standby system with one spare. 


The state transition equations eorresponding to the state transition diagram 
(4.16) are 

r |Pi(t) = -AiPi(t) 

I iP2{t)=phPl{t)-'k2P2{t) 

[ f^P3{t)=X2P2{t) + {l-p)hPl{t) 

By solving this system of equations, we get 


Pi(f) =e-^'‘ 


Piit) 

pm 




As before, /’ 3 (f) eorresponds to system failure. So, the reliability of the system 
is the sum of P\ (t) and P 2 {t) 


K 2 — 

Assuming A ,2 = Xi, the above ean be simplified to 

Pssif) = {l + ph)e~^ 



(4.7) 


Figure 4.17 eompares the reliability of a standby sparing system for different 
values of p. As p deereases, the reliability of the standby sparing system 
deereases. When p reaehes zero, the standby sparing system reliability reduees 
to the reliability of a simplex system. 

4.3 Pair-and-a-spare 

Pair-and-a-spare teehnique eombines standby sparing and duplieation and 
eomparison approaehes (Figure 4.18). The idea is similar to standby sparing, 
however two modules instead of one are operated in parallel. As in the dupliea¬ 
tion with eomparison ease, the results are eompared to deteet disagreement. If 
an error signal is reeeived from the eomparator, the switeh analyzes the report 


DRAFT 


March 25, 2008, 2:12ain 


DRAFT 



Hardware redundancy 


63 



Figure 4.17. Reliability of a standby sparing system for different values of p. 


from the fault detection block and decides which of the two modules’ output is 
faulty. The faulty module is removed from operation and replaced with a spare 
module. 


input 1 


input 2 


input n 



output 


Figure 4.18. Pair-and-a-spare redundancy. 


A pair-and-a-spare system with n modules can tolerate n — \ module faults. 
When the n — 1th fault occurs, it will be detected and located by the switch and 
the correct result will be passed to the system’s output. However, since there 
will be no more spares available, the switch will not be able to replace the faulty 
module with a spare module. The system’s configuration will be reduced to a 
simplex system with one module. So, the nth fault will not be detected. 
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5. Hybrid redundancy 

The main idea of hybrid redundaney is to eombine the attraetive features 
of passive and aetive approaeh. Fault masking is used to prevent system from 
produeing momentary erroneous results. Fault deteetion, loeation and reeovery 
are used to reeonfigure the system after a fault oeeurs. In this seetion, we eon- 
sider three basie teehniques for hybrid redundaney: self-purging redundaney, 
N-modular redundaney with spares and triplex-duplex redundaney. 

5.1 Self-purging redundancy 

Self-purging redundaney eonsists of n identieal modules whieh are aetively 
partieipating in voting (Figure 4.19). The output of the voter is eompared to 
the outputs of individual modules to deteet disagreement. If a disagreement 
oeeurs, the switeh opens and removes, or purges, the faulty module from the 
system. The voter is designed as a threshold gate, eapable to adapt to the 
ehanging number of inputs. The input of the removed module is foreed to zero 
and therefore do not eontribute to the voting. 



Figure 4.19. Self-purging redundancy. 


A self-purging redundaney system with n modules ean mask n — 1 module 
faults. When n — 2 modules are purged and only two are left, the system will be 
able to deteet the next, u — 1th fault, but, as in the duplieation with eomparison 
ease, the voter will not be able to distinguish whieh one of the two results is the 
eorreet one. 

5.1.1 Reliability evaluation 

Sinee all the modules of the system operate in parallel, we ean assume that 
the modules’ failures are mutually independent. It is suffieient that two of the 
modules of the system funetion eorreetly for the system to be operational. If 
the voter and the switehes are perfeet, and if all the modules have the same 
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reliability Ri = R 2 = ■ ■ ■ = Rn = R, then the system is not reliable if all the 
modules have failed (probability (1 —/?)”), or if all but one modules have failed 
(probability /?(1 Sinee there are n ehoiees for one of n modules to 

remain operational, we get the equation 

= + (4.8) 

Figure 4.20 eompares the reliabilities of self-purging redundaney systems 
with three, five and seven modules. 



Figure 4.20. Reliability of a self-purging redundancy system with 3, 5 and 7 modules. 


5.2 N-modular redundancy with spares 

N-modular redundaney with k spares is similar to self-purging redundaney 
with k + n modules, exeept that only n modules provide input to a majority 
voter (Figure 4.21). Additional k modules serve as spares. If one of the primary 
modules beeomes faulty, the voter will mask the erroneous result and the switeh 
will replaee the faulty module with a spare one. Various teehniques are used 
to identify faulty modules. One approaeh is to eompare the output of the voter 
with the individual outputs of the modules, as shown in Figure 4.21. A module 
whieh disagrees with the majority is deelared faulty. 

The fault-tolerant eapabilities of an N-modular redundaney system with k 
spares depend on the form of voting used as well as the implementation of the 
switeh and eomparator. One possibility is that, after the spares are exhausted, 
the disagreement deteetor is switehed off and the system eontinues working 
as a passive NMR system. Then, sueh a system ean mask \n/2\ +k faults. 
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input 1 


input 2 


input n 



output 


Figure 4.21. N-modular redundancy with spares. 


i.e. the number of faults a NMR system can mask plus the number of spares. 
Another possibility is that the disagreement detector remains on, but the voter is 
designed to be capable to adjust to the decreasing number of inputs. In this case, 
the behavior of the system is similar to the behavior of a self-purging system 
with n + k modules, i.e. up to k -|- n — 2 module faults can be masked. Suppose 
the spares are exhausted after the first k faults, and the k+ 1th fault occurred. 
As before, the erroneous result will be masked by the voter, the output of the 
voter will be compared to the individual outputs of the modules, and the faulty 
will be removed from considerations. A difference is that it will not be replaced 
with a spare one, but instead the system will continue working as n — 1-modular 
system. Then a k-|- /th fault occurs, the voter votes on n — i) modules. 

5.3 Triplex-duplex redundancy 

Triplex-duplex redundancy combines triple modular redundancy and dupli¬ 
cation with comparison (Figure 4.22). A total of six identical modules, grouped 
in three pairs, are computing in parallel. In each pair, the results of the com¬ 
putation are compared using a comparator. If the results agree, the output of 
the comparator participates in the voting. Otherwise, the pair of modules is 
declared faulty and the switch removes the pair from the system. In this way, 
only faulty-free pair participates in voting. 


DRAFT 


March 25, 2008, 2:12ain 


DRAFT 





Hardware redundancy 


67 


input 

input 

input 

input 

input 

input 


la 

lb 

2a 

2b 

3a 

3b 



output 


Figure 4.22. Triplex-duplex redundancy. 


6. Problems 

4.1. Explain the difference between passive, active and hybrid hardware redun¬ 
dancy. Discuss the advantages and disadvantages of each approach. 

4.2. Suppose that in the system shown in Figure 4.1 the two components have 
the same cost and R\ = 0.75, R 2 = 0.96. If it is permissible to add two 
components to the system, would it be preferable to replace component 1 by 
a three-component parallel system, or to replace components 1 and 2 each 
by two-component parallel systems? 

4.3. A disk drive has a constant failure rate and an MTTF of 5500 hr. 

(a) What is the probability of failure for one year of operation? 

(b) What is the probability of failure for one year of operation if two of the 
drives are placed in parallel and the failures are independent? 

4.4. Construct the Markov model of the TMR system with three voters shown in 
Figure 4.6. Assume that the components are independent. The failure rate 
of the modules is Xm- The failure rates of the voters is Xy. Derive and solve 
the system of state transition equations representing this system. Compute 
the reliability of the system. 

4.5. Draw a logic diagram of a majority voter with 5 inputs. 

4.6. Suppose the design life reliability of a standby system consisting of two 
identical units should be at least 0.97. If the MTTF of each unit is 6 months. 
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determine the design life time. Assume that the failures are independent 
and ignore switching failures. 

4 . 7 . An engineer designs a system consisting of two subsystems in series with 
the reliabilities R\ = 0.99 and R 2 = 0.85. The cost of the two subsystems 
is approximately the same. The engineer decides to add two redundant 
components. Which of the following is the best to do: 

(a) Duplicate subsystems 1 and 2 in high-level redundancy (Figure 4.2(a)). 

(b) Duplicate subsystems 1 and 2 in low-level redundancy (Figure 4.2(b)). 

(c) Replace the second subsystem by a three-component parallel system. 

4 . 8 . A computer with MTTF of 3000 hr is to operate continuously on a 500 hr 
mission. 

(a) Compute computer’s mission reliability. 

(b) Suppose two such computers are connected in a standby configuration. 
If there are no switching failures and no failures of the backup computer 
while in the standby mode, what is the system MTTF and the mission 
reliability? 

(c) What is the mission reliability if the probability of switching failure is 

0 . 02 ? 

4 . 9 . A chemical process control system has a reliability of 0.97. Because reli¬ 
ability is considered too low, a redundant system of the same design is to 
be installed. The design engineer should choose between a parallel and a 
standby configuration. How small must the probability of switching failure 
be for the standby configuration to be more reliable than the parallel con¬ 
figuration? Assume that there is no failures of the backup system while in 
the standby mode. 

4 . 10 . Give examples of applications where you would recommend to use cold 
standby sparing and hot standby sparing (two examples each). Justify your 
answer. 

4 . 11 . Compare the MTTF of a standby spare system with 3 modules and a pair- 
and-a-spare system with 3 modules, provided the failure rate of a single 
module is 0.01 failures per hour. Assume the modules obey the exponential 
failure law. Ignore the switching failures and the dependency between the 
module’s failures. 

4 . 12 . A basic non-redundant controller for a heart pacemaker consists of an analog 
to digital (A/D) converter, a microprocessor and a digital to analog (D/A) 
converter. Develop a design making the controller tolerant to any two com¬ 
ponent faults (component here means A/D converter, microprocessor or D/A 
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converter). Show the block diagram of your design and explain why you 
recommend it. 

4 . 13 . Construct the Markov model of a hybrid A^-modular redundancy with 3 active 
modules and one spare. Assume that the components are independent and 
that the probability that the switch successfully replaces the failed module 
by a spare is p. 

4 . 14 . How many faulty modules can you tolerate in: 

(a) 5-modular passive redundancy? 

(b) standby sparing redundancy with 5 modules? 

(c) self-purging hybrid modular redundancy with 5 modules? 

4 . 15 . Design a switch for hybrid A-modular redundancy with 3 active modules 
and 1 spare. 

4 . 16 . (a) Draw a diagram of standby sparing active hardware redundancy tech¬ 

nique with 2 spares. 

(b) Using Markov models, write an expression for the reliability of the 
system you showed on the diagram for 

(a) perfect switching case, 

(b) non-perfect switching case. 

(c) Calculate the reliabilities for (a) and (b) after 1000 hrs for the failure 
rate X = 0.01 per 100 hours. 

4 . 17 . Which redundancy would you recommend to combine with self-purging 
hybrid hardware redundancy to distinguish between transient and perma¬ 
nent faults? Briefly describe what would be the main benefit of such a 
combination. 

4 . 18 . Calculate the MTTF of a 5-modular hardware redundancy system, pro¬ 
vided the failure rate of a single module is 0.001 failures per hour. Assume 
the modules obey the exponential failure law. Compare the MTTF of the 
5-modular redundancy system with the MTTF of a 3-modular hardware 
redundancy system having the failure rate 0.01 failures per hour. 

4 . 19 . Draw a simplified Markov model for the 5-modular hardware redundancy 
scheme with failure rate X. Explain which state of the system each of the 
nodes in your chain represents. 
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Chapter 5 

INFORMATION REDUNDANCY 


The major dijference between a thing that might go wrong and a thing that cannot possibly 
go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns 
out to be impossible to get at or repair. 

—Douglas Adams, "Mostly Harmless" 


1. Introduction 

In this chapter we study how fault-tolerance can be achieved by means of 
encoding. Encoding is powerful technique which helps us to avoid unwanted 
information changes during storage or transmission. Attaching special check 
bits to blocks of digital information enables special-purpose hardware to detect 
and correct a number of communication and storage faults, such as changes 
in single bits or changes to several adjacent bits. Parity code used for random 
access memories in computer systems is a common example of an application of 
encoding. Other examples are communication protocols that provide a variety 
of detection and correction options including the encoding of large blocks of 
data to withstand multiple faults and provisions for multiple retries in the case 
the error correcting facilities cannot cope with the faults. 

Coding theory was originated in the late 1940s, by two seminal works by 
Hamming and Shannon. Hamming, working at Bell Laboratories in the USA, 
was studying possibilities for protecting storage devices from the corruption of 
a small number of bits by a code which would be more efficient than simple 
repetition. He realized the need to consider sets of words, or codewords, where 
every pair differs in a large number of bit positions. Hamming defined fhe nofion 
of disfance befween fwo words and observed fhis was a mefric, fhus leading fo 
inferesfing properfies. This disfance is now called Hamming distance. His firsl 
affempf produced a code in which four dafa bifs were followed by fhree check 
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bits which allowed not only the detection but the correction of a single error. 
The repetition code would require nine check bits to achieve this. Hamming 
published his results in 1950. 

Slightly prior to Hamming’s publication, in 1948, Shannon, also at Bell Labs, 
wrote an article formulating the mathematics behind the theory of communi¬ 
cation. In this article, he developed probability and statistics to formalize the 
notion of information. Then, he applied this notion to study how a sender can 
communicate efficiently over different media, or more generally, channels of 
communication to a receiver. The channels under consideration were of two 
different types: noiseless or noisy. In the former case, the goal is to compress 
the information at the sender’s end and to minimize the total number of symbols 
communicated while allowing the receiver to recover transmitted information 
correctly. The later case, which is more important to the topic of this book, 
considers a channel that alters the signal being sent by adding to it a noise. The 
goal in this case is to add some redundancy to the message being sent so that a 
few erroneous symbols at the receiver’s end still allow the receiver to recover the 
sender’s intended message. Shannon’s work showed, somewhat surprisingly, 
that the same underlying notions captured the rate at which one could com¬ 
municate over either class of channels. Shannon’s methods involved encoding 
messages using long random strings, and the theory relied on the fact that long 
messages chosen at random tend to be far away from each other. Shannon had 
shown that it was possible to encode messages in such a way that the number 
of extra bits transmitted was as small as possible. 

Although Shannon’s and Hamming’s works were chronologically and tech¬ 
nically interwined, both researchers seem to regard the other as far away from 
their own work. Shannon’s papers never explicitly refers to distance in his main 
technical results. Hamming, in turn, does not mention the applicability of his 
results to reliable computing. Both works, however, were immediately seen 
to be of monumental impact. Shannon’s results started driving the theory of 
communication and storage of information. This, in turn, became the primary 
motivation for much research in the theory of error-correcting codes. 

The value of error-correcting codes for transmitting information became 
immediately apparent. A wide variety of codes were constructed, achieving 
both economy of transmission and error-correction capacity. Between 1969 
and 1973 the NASA Mariner probes used a powerful Reed-Muher code capable 
of correcting 7 errors out of 32 bits transmitted. The codewords consisted of 6 
data bits and 26 check bits. The data was sent to Earth at the rate over 16,000 
bits per second. 

Another application of error-correcting codes came with the development of 
the compact disk (CD). To guard against scratches, cracks and similar damage 
two "interleaved" codes which can correct up to 4,000 consecutive errors (about 
2.5 mm of track) are used. 
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Code selection is usually guided by the types of errors required to be tolerated 
and the overhead associated with each of the error detection techniques. For 
example, error correction is a common level of protection for minicomputers 
and mainframes whereas the cheaper error detection by parity code is more 
common in microcomputers. For solid state disks, storing system’s critical, 
non-recoverable files, the most popular codes are Hamming codes to correct 
errors in main memory, and Reed-Solomon codes to correct errors in peripheral 
devices such as tape and disk storage. 

2. Fundamental notions 

In this section, we introduce the basic notions of coding theory. We assume 
that our data is in the form of strings of binary bits, 0 or 1. We also assume 
that the errors occur randomly and independently from each other, but at a 
predictable overall rate. 

2.1 Code 

A binary code of length n is a set of binary n-tuples satisfying some well- 
defined sef of rules. For example, an even parify code confains all n-fuples 
fhaf have an even number of Is. The sef = {0,1}” of all possible 2” binary 
n-fuples is called codespace. 

A codeword is an elemenf of fhe codespace salisfying fhe rules of fhe code. 
To make error-defection and error-correction possible, codewords are chosen 
fo be a nonempfy subsef of all possible 2” binary n-fuples. For example, a 
parify code of lengfh n has 2”“' codewords, which is one half of all possible 2” 
n-fuples. An n-fuple nof safisfying fhe rules of fhe code is called a word. 

The number of codewords in a code C is called fhe size of C, denofed by |C|. 

2.2 Encoding 

Encoding is fhe process of compufing a codeword for a given dafa. An 
encoder fakes a binary k-fuple representing fhe dafa and converfs if fo a codeword 
using fhe rules of fhe code. For example, fo compufe a codeword for an even 
parify code, fhe parify of fhe dafa is firsf defermined. If fhe parify is odd, a 
1-bif is attached fo fhe end of fhe k-fuple. Ofherwise, a 0-bif is attached. The 
difference n — k befween fhe lengfh n of fhe codeword and fhe lengfh k of fhe 
dafa gives fhe number of check bits which musf be added fo fhe dafa fo do fhe 
encoding. Separable code is fhe code in which fhe check bifs can be clearly 
separafed from fhe dafa bifs. Parify code is an example of a separable code. 
Non-separable code is a code in which fhe check bifs cannof be separafed from 
fhe dafa bifs. Cyclic code is an example of a non-separable code. 
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2.3 Information rate 

To encode binary ^-bit data, we need a code consisting of at least 2^ code¬ 
words, since any data word should be assigned its own individual codeword 
from C. Vice versa, a code of size |C| encodes the data of length k < [log 2 |C|] 
bits. The ratio k/n is called the information rate of the code. The information 
rate determines the redundancy of the code. For example, a repetition code 
obtained by repeating the data three times, has the information rate 1 /3. Only 
one out of three bits carries the message, the other two are redundant. 

2.4 Decoding 

Decoding is the process of restoring data encoded in a given codeword. A 
decoder reads a codeword and recovers the original data using the rules of the 
code. For example, a decoder for a parity code truncates the codeword by one 
bit. 

Suppose that an error has occurred and a non-codeword is received by a 
decoder. A usual assumption in coding theory is that a pattern of errors that 
involves a small number of bits is more likely to occur than any pattern that 
involves a large number of bits. Therefore, to perform decoding, we search for 
a codeword which is “closest” to the received word. Such a technique is called 
maximum likelihood decoding. As a measure of distance between two binary 
n-tuples X and y we use the Hamming distance. 

2.5 Hamming distance 

The Hamming distance between two binary n-tuples, x and y, denoted by 
5(x,y), is the number of bit positions in which the n-tuples differ. For example, 
X = 0011 and y = 0101 differ in 2 bit positions, so 5(x,y) = 2. Hamming 
distance gives us an estimate of how many bit errors have to occur to change x 
into y. 

Hamming distance is a genuine metric on the codespace B". A metric is 
a function that associates any two objects in a set with a number and that 
preserves a number of properties of the distance with which we are familiar. 
These properties are formulated in the following three axioms: 

1 5(x,y) = 0 if and only if x = y. 

2 5(x,y) = 5(y,x). 

3 5(x,y)-|-6(y,z) > 6(x,z). 

The metric properties of the Hamming distance allow us to use the geom¬ 
etry of the codespace to reason about the codes. As an example, consider 
the codespace presented by a three-dimensional cube shown in Figure 5.1. 
Codewords {000,011,101,110} are marked with large solid dots. Adjacent 
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vertices differ by a single bit. It is easy to see that the Hamming distance satis¬ 
fies the metric properties listed above, e.g. 6(000,011) -f 5(011, 111) = 2-|-l = 
3 = 6(000,111). 

2.6 Code distance 

The code distance of a code C is the minimum Hamming distance between 
any two distinct pairs of codewords of C. For example, the code distance of 
a parity code equals two. The code distance determines the error detecting 
and error correcting capabilities of a code. For instance, consider the code 
{000,011,101,110} in Figure 5.1. The code distance of this code is two. Any 
one-bit error in any codeword produces a word laying on distance one from the 
affected codeword. Since all codewords are on distance two from each other, 
the error will be detected. 

As another example, consider the code {000, 111} shown in Figure 5.2. The 



Figure 5.2. Code {000, 111} in the codespace sf. 


codewords are marked with large solid dots. Suppose an error occurred in the 
first bit of the codeword 000. The resulting word 100 is on distance one from 
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000 and on distance two from 111. Thus, we correct 100 to the codeword 000, 
which is closest to 100 according to the Hamming distance. 

The code {000, 111} is a replication code, obtained by repeating the data 
three times. Only one of the bits of a codeword carries the data, the other 
two are redundant. By its nature, this redundancy is similar to TMR, but it 
is implemented in the information domain. In TMR, the voter compares the 
output values of the modules. In a replication code, a decoder analyzes the bits 
of the received word. In both cases, the majority of values of bits determines 
the decision. 

In general, to be able to correct s-bit errors, a code should have the code 
distance of at least 2s + 1. To be able to detect s-bit errors, the code distance 
should be at least s + 1. 

2.7 Code efficiency 

Throughout the chapter, we evaluate the efficiency of a code using the fol¬ 
lowing three criteria: 

1 Number bit errors a code can detect/correct, reflecting the fault tolerant 

capabilities of the code. 

2 Information rate kjn, reflecting the amount of information redundancy added. 

3 Complexity of encoding and decoding schemes, reflecting the amount of 

hardware, software and time redundancy added. 

The first item in the list above is the most important. Ideally, we would like 
to have a code that is capable of correcting all errors. The second objective is 
an efficiency issue. We would rather not waste resources by exchanging data 
on a very low rate. Easy encoding and decoding schemes are likely to have a 
simple implementation in either hardware or software. They are also desirable 
for efficiency reasons. In general, the more errors that a code needs to correct 
per message digit, the less efficient the communication and usually the more 
complicated the encoding and decoding schemes. A good code balances these 
objectives. 

3. Parity codes 

Parity codes are the oldest family of codes. They have been used to detect 
errors in the calculations of the relay-based computers in late 1940’s. 

The even (odd) parity code of length n is composed of all the binary n-tuples 
that contain an even (odd) number of I’s. Any subset of n — 1 bits of a codeword 
can be viewed as data bits, carrying the information, while the remaining nth bit 
checks the parity of the codeword. Any single-bit error can be detected, since 
the parity of the affected n-tuple will be odd (even) rather than even (odd). It is 
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not possible to locate the position of the erroneous bit. Thus, it is not possible 
to correct it. 

The most common application of parity is error-detection in memories of 
computer systems. A diagram of a memory protected by a parity code is shown 
in Figure 5.3. 



error signal 


Figure 5.3. A memory protected by a parity code; PG = parity generator; PC = parity checker. 


Before being written into a memory, the data is encoded by computing its 
parity. In most computer systems, one parity bit per byte (8 bits) of data is 
computed. The generation of parity bits is done by a parity generator (PG) 
implemented as a tree of exclusive-OR (XOR) gates. Figure 5.4 shows a logic 
diagram of an even parity generator for 4-bit data {do,di,d 2 Td^)- 



ds -)I^ bit 

Figure 5.4. Logic diagram of a parity generator for 4-bit data [dQ,d\,d 2 ,df). 


When data is written into memory, parity bits are written along with the 
corresponding bytes of data. For example, for a 32-bit word size, four parity 
bits are attached to data and a 36-bit codeword is stored in the memory. Some 
systems, like Pentium processor, have a 64-bit wide memory data path. In these 
case, eight parity bits are attached to data. The resulting codeword 72 bit long. 

When the data is read back from the memory, parity bits are re-computed 
and the result is compared to the previously stored parity bits. Re-computation 
of parity is done by a parity checker (PC). Figure 5.5 shows a logic diagram 


DRAFT 


March 25, 2008, 2:12ain 


DRAFT 




78 


FAULT TOLERANT DESIGN: AN INTRODUCTION 


of an even parity eheeker for 4-bit data The logie diagram is 

similar to the one of a parity generator, exeept that one more XOR gate is added 
to eompare the re-eomputed parity bit to the previously stored parity bit. If the 
parity bits disagree, the output of the XOR gate is 1. Otherwise, the output is 
0 . 



Figure 5.5. Logic diagram of a parity checker for 4-bit data {do,di,d 2 ,d^). 


Any eomputed parity bit that does not match the stored parity bit indicates 
that there was at least one bit error in the corresponding byte of data, or in the 
parity bit itself. An error signal, called non-maskable interrupt, is sent to the 
CPU to indicate that the memory data is not valid and to instruct the processor 
to immediately halt. 

All operations related to the error-detection (encoding, decoding, compari¬ 
son) are done by the memory control logic on the mother-board, in the chip set, 
or, for some systems, in the CPU. The memory itself only stores the parity bits, 
just as it stores the data bits. Therefore, parity checking does not slow down 
the operation of the memory. The parity bit generation and checking is done 
in parallel with the writing and reading of the memory using the logic which 
is much faster that the memory itself. Nothing in the system waits for a “no 
error” signal from the parity checker. The system only performs an action of 
interrupt when it finds an error. 

Example 5.1. Suppose the data which is written in the memory is [0110110] 
and odd-parity code is used. Then the check bit 1 is stored along with the data 
to make the overall parity odd, i.e. the codeword [01101101]. Suppose that the 
codeword read out of the memory is [01111101]. The re-computed parity is 0. 
Because the re-computed parity disagree with the stored parity, we know that 
an error has occurred. 

Parity can only detect single bit errors and an odd number of bit errors. If an 
even number of bits are affected, the computed parity matches the stored parity, 
and the erroneous data is accepted with no error notification, possibly causing 
later some mysterious problems. Studies have shown that approximately 98% 
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of all memory errors are single-bit errors. Thus, proteeting a memory by a 
parity eode is an inexpensive and effieient teehnique. For example, 1 GByte 
dynamie random aeeess (DRAM) memory with a parity eode has a failure rate 
of 0.7 failures per year. If the same memory uses a single-error eorreetion 
double-error deteetion Hamming eode, requiring 7 eheek bits in a 32-bit wide 
memory system, then the failure rate reduees to 0.03 failures per year. An error 
eorreeting memory is typieally slower than a non-eorreeting one, due to the 
error eorreeting eireuitry. Depending on the applieation, 0.7 failures per year 
may be viewed as an aeeeptable level of risk, or not. 

A modifieation of the parity eode is the horizontal and vertical parity code, 
whieh arranges the data in a 2 -dimensional array and add one parity bit on eaeh 
row and one parity bit on eaeh eolumn. Sueh a teehnique is useful for eorreeting 
single bit errors within a bloek of data words, however, it may fail eorreeting 
multiple errors. 

4. Linear codes 

Linear eodes provide a general framework for generating many eodes, in¬ 
eluding the Hamming eode. The diseussion of linear eodes requires some 
knowledge of linear algebra, whieh we first briefly review. 

4.1 Basic notions 

Field Z 2 . Afield Z 2 is the set {0,1} together with two operations, addition 
“-H” and multiplieation sueh that the following properties are satisfied for 
all a,b,c E Z 2 . 

1 Z 2 is elosed under and “-H”, meaning that a-b EZ 2 and a-\-b EZ 2 . 

2 a-\-{b-\-c) = {a-\-b)-\-c. 

3 a-\-b = b + a 

4 There exists an element 0 in Z 2 sueh that a + 0 = a. 

5 For eaeh a E Z 2 , there exists an element —a E Z 2 sueh that a -f (—a) = 0 . 

6 a-{b-o') = (a-bfi c. 

1 a-b = b - a. 

8 There exists an element 1 in Z 2 sueh that a • 1 = 1 • a = a. 

9 For eaeh a E Z 2 , sueh that a 7 ^ 0 , there exists an element a~^ € Z 2 sueh that 
a ■ a~^ = 1 . 

10 a-{b-\-c) = a-b-\-a-c. 
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It is easy to see that above properties are satisfied if “+” is defined as addition 
modulo 2, is defined as multiplication modulo 2,0 = 0 and 1 = 1. Throughouf 
fhe chapfer, we assume fhis definifion of Z 2 . 

Vector space V”. Lef Z 2 denofe fhe sef of all n-fuples confaining elemenfs 
from Z 2 . For example, for n = 3, Z^ = {000,001,010,011,100,110, 111}. 

A vector space V " over a field Z 2 is subsef of Z 2 , wifh fwo operafions, addifion 
“+” and multiplication such fhaf fhe following axioms are safisfied for all 
x,y,z e V” and all a^b^c ^ Z„: 

1 y” is closed under “+”, meaning fhaf v + y € y”. 

2 x-\-y = y-\-X. 

3 x+{y + z) = {x + y)+z. 

4 There exisfs an elemenf 0 in y” such fhaf v + 0 = v. 

5 For each x € y”, fhere exisfs an elemenf —x € y” such fhaf x+ (—x) = 0. 

6 a-xey". 

7 There exisfs an elemenf 1 € Z 2 such fhaf 1 • x = x. 

8 a ■ (x + y) = (a ■ x) + (a ■ y). 

9 (a + b)-x = a-x + b-x. 

10 {a-b) ■ X = a - {b■ x). 

A subspace is a subsef of a vecfor space fhaf is ifself a vector space. 

A sef of vecfors {vq, • •., v^,-!} is said to span a vecfor space y” if any v € y" 
can be wriffen as v = aovo + aivi + ... + ak-\Vk-i, where ao,... € Z 2 . 

A sef of vecfors {vq, ■ • ■, va:_i} of y” is said to be linearly independent if 
aovo + a\V\ + ... + ak-\Vk-\ = 0 implies fhaf ao = a\ = ... = at-i = 0. 

A basis is a sef of vecfors in a vecfor space V" fhaf are linearly independenf 
and span y”. 

The dimension of a vecfor space is defined to be fhe number of vecfors in ifs 
basis. 

4.2 Definition of linear code 

A {n,k) linear code over fhe field Z 2 is a ^-dimensional subspace of y„. In 
ofher words, a linear code of lengfh n is a subspace of y” which is spanned 
by k linearly independenf vecfors. All codewords can be wriffen as a linear 
combination of fhe k basis vectors {vq, .. •, Vk-\} as follows: 

c = r/ovo + divi + ... + dk-iVk-\ 
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Since a different codeword is obtained for each different combination of co¬ 
efficients dQ^d\^... ,dk-i, we obtain an easy method of encoding if we define 
d = (r/o,<ii,..., dk-i) as the data to be encoded. 

Example 5.2. As an example, let us construct a (4,2) linear code. 

The data we are encoding are2-bit words {[00], [01], [10], [11]}. These words 
need to be encoded so that the resulting 4-bit codewords form a two-dimensional 
subspace of To do this, we have to select two linearly independent vectors 
as a basis of the two-dimensional subspace. One possibility is to choose the 
vectors Vo = [1000] andvi = [0110]. They are linearly independent since neither 
is a multiple of the other. 

To find the codeword, c, corresponding to the data word d = [d^di], we 
compute the linear combination of the two basis vectors {vq, vi} as c = r/offi + 
d\V\. Thus, the data word r/ = [11] is encoded to 

c = 1-[1000] +1-[0110] = [1110] 

Recall, the “+” is defined as an XOR, so 1 + 1 = 0. 

Similarly, [00] is encoded to 

c = 0 • [ 1000 ] + 0 • [ 0110 ] = [ 0000 ] 

[ 01 ] is encoded to 

c = 0 -[ 1000 ] + 1 -[ 0110 ] = [ 0110 ] 


and [ 10 ] is encoded to 

c = 1 • [ 1000 ] + 0 • [ 0110 ] = [ 1000 ]. 


4.3 Generator matrix 

The computations we performed can be formalized by introducing the gen¬ 
erator matrix G, whose rows are the basis vectors vq through Vk- 1 . For instance, 
the generator matrix for the Example 5.2 is 


G = 


10 0 0 
0 110 


(5.1) 


The codeword c is a product of the generator matrix G and the data word d: 


c = dG 


Note, that in the Example 5.2 the first two bits of each codeword are exactly 
the same as the data bits, i.e. the code we have constructed is a separable code. 
Separable linear codes are easy to decode by truncating the last n — k bits. 
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In general, separability ean be aehieved by insuring that basis veetors form a 
generating matrix of the form [ 4 ^], where 4 is an identity matrix of size ky.k. 

Note also that the eode distanee of the eode in the Example 5.2 is one. 
Therefore, sueh a eode eannot deteet errors. It is possible to prediet eode 
distanee by examining the basie veetors. Consider the veetor vi = [1000] from 
the Example 5.2. Sinee it is a basie veetor, it is a eodeword. A eode is a subspaee 
E” and thus is itself a veetor spaee. A veetor spaee is elosed under addition, thus 
c + 1000 also belongs to the eode, for any eodeword c. The distanee between 
c and c + 1000 is 1, sinee they differ only in the first bit position. Therefore, a 
eode with a eode distanee 5 should have basis veetors of weight greater than or 
equal to 5. 

Example 5.3. Consider the (6,3) linear eode spanned by the basie veetors 
[100011], [010110] and [001101]. The generator matrix for this eode is 


G = 


1 0 0 0 1 1 
0 10 110 
0 0 110 1 


(5.2) 


Eor example, the data word d = [011] is eneoded to 

c = 0 -[ 100011 ] + 1 -[ 010110 ] + !-[ 001101 ] = [ 011011 ]. 

Reeall, the “+” is modulo 2, so 1 + 1 = 0. Similarly, we ean eneode other data 
words. The resulting eode is presented in Table 5.1. Eaeh row of the table shows 
a data word d = [d\d 2 d'i\ and the eorresponding eodeword c = [C 1 C 2 C 3 C 4 C 5 C 6 ]. 


di 

data 

d2 

d3 

Cl 

C 2 

eodeword 

C 3 C 4 

C5 

C 6 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

1 

1 

0 

1 

0 

1 

0 

0 

1 

0 

1 

1 

0 

0 

1 

1 

0 

1 

1 

0 

1 

1 

1 

0 

0 

1 

0 

0 

0 

1 

1 

1 

0 

1 

1 

0 

1 

1 

1 

0 

1 

1 

0 

1 

1 

0 

1 

0 

1 

1 

1 

1 

1 

1 

1 

0 

0 

0 


Table 5.1. Defining table for a (6,3) linear code. 
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4.4 Parity check matrix 

To detect errors in a (n, k) linear code, we use an{n — k)xn matrix H, called 
the parity check matrix of the code. The parity check matrix represents the 
parity of the codewords. The matrix H has the property that, for any codeword 
c, H ■ c^ = 0. By c^ we denote a transponse of the vector c. Recall, that the 
transpose of an n x k matrix A is the kxn matrix obtained by defining the /th 
column of A^ to be the /th row of A. 

The parity check matrix is related to the generator matrix by the equation 

HG^ = 0 

This equation implies that, if data d is encoded to a codeword dG using the 
generator matrix G, then the product of the parity check matrix and the encoded 
message is zero. This is true because 

H{dGf = H{G^d^) = {HG^)d^ = 0 

If a generator matrix is of the form G = [4A], then 

H = [A^In-k] 

is a parity check matrix. This can be proved as follows: 

HG^ = A^4 + In-kA^ = A^ + A^ = 0. 


Example 5.4. Let us construct the parity matrix H for the generator matrix G 
given by (5.2). G is of the form [/ 3 A] where 


So, we have 


A = 


0 1 1 
1 1 0 
1 0 1 


H=[A^k] 


0 1110 0 
110 0 10 
10 10 0 1 


(5.3) 


4.5 Syndrome 

Encoded data can be checked for errors by multiplying it by the parity check 
matrix: 

5 = Hc'^ (5.4) 

The resulting k-bit vector s is called syndrome. If the syndrome is zero, no 
error has occurred. If s matches one of the columns of H, then a single-bit error 


DRAFT 


March 25, 2008, 2:12ain 


DRAFT 



84 


FAULT TOLERANT DESIGN: AN INTRODUCTION 


has occurred. The bit position of the error corresponds to the position of the 
matching column in H. For example, if the syndrome coincides with the second 
column of H, the error is in the second bit of the codeword. If the syndrome is 
not zero and is not equal to any of the columns of H, then a multiple-bit error 
has occurred. 


Example 5.5. As an example, consider the data d = [110] encoded using 
the (6,3) linear code from the Example 5.3 as c = dG = [110101]. Suppose 
that an error occurs in the second bit of c, transforming it to [100101]. By 
multiplying this word by the parity check matrix (5.3), we obtain the syndrome 
5 = [110]. The syndrome matches the second column of the parity check matrix 
H, indicating that the error has occurred in the second bit. 

4.6 Constructing linear codes 

As we showed in Section 2.6, for a code to be able to correct s errors, its code 
distance should be at least 2s + 1. It is possible to ensure a given code distance 
by carefully selecting the parity check matrix and then by using it to construct 
the corresponding generator matrix. It is proved that a code has distance of at 
least 5 if and only if every subset of 5 — 1 columns of its parity check matrix 
H are linearly independent. So, to have a code distance two, we must ensure 
that every column of the parity check matrix is linearly independent. This is 
equivalent to the requirement of not having a zero column, since the zero vector 
can never be a member of a set of linearly independent vectors. 


Example 5.6. In the parity check matrix (5.1) of the code which we have 
constructed in Example 5.2, the first column is zero: 


H = 


0 110 
0 0 0 1 


Therefore, columns of H are linearly dependent and the code distance is one. 

Eet us modify H to construct a new code with the code distance of at least 
two. Suppose to replace the zero column by the column containing 1 in all its 
entries: 

f 1 1 1 0 

10 0 1 



So, now A is 


A 


1 1 
1 0 


and therefor G can be constructed as 


G = [hA^] 


10 11 
0 110 
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Using this generator matrix, the data words are eneoded as dG resulting in a 
eode shown in Table 5.2. 


data 
d\ d2 

Cl 

eodeword 

C2 C3 

C4 

0 

0 

0 

0 

0 

0 

0 

1 

0 

1 

1 

0 

1 

0 

1 

0 

1 

1 

1 

1 

1 

1 

0 

1 


Table 5.2. Defining table for a (4,2) linear code. 


The eode distanee of the resulting (4,2) eode is two. So, this eode eould be 
used to deteet single-bit errors. 

Example 5.7. Let us eonstruet a eode with a minimum eode distanee three, 
eapable of eorreeting single-bit errors. We apply an approaeh similar to the one 
inExample5.6. First, we ereate a parity eheek matrix in the form [A^4_jt], sueh 
that every pair of its eolumns is linearly independent. This ean be aehieved by 
ensuring that eaeh eolumn is non-zero and no eolumn is repeated twiee. 

For, example if our goal is to eonstruet a (3,1) eode, then the following 
matrix 


has all its eolumns non-zero and no eolumn are repeats twiee. The matrix 
in this ease is 



So, A = [11] and therefore G is 

G=[ 1 1 1 ]. 

The resulting (3,1) eode eonsists of two eodewords, 000 and 111. 


4.7 Hamming codes 

Hamming eodes are a family of linear eodes. They are named after Riehard 
W. Hamming, who developed the first single-error eorreeting Hamming eode 
and its extended version, single-error eorreeting double-error deteeting Ham¬ 
ming eode in the early 1950’s. These eodes remain important until today. 
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Consider the following parity eheek matrix, eorresponding to a (7,4) Ham- 


ming code: 


■ 1 

1 

0 

1 

1 

0 

0 ■ 



77 = 

1 

0 

1 

1 

0 

1 

0 

(5.5) 



1 

1 

1 

0 

0 

0 

1 



H has n = l columns of length n — k = 3. Note, that 7 = 2^“^ — 1, so the 
columns of H represent all possible non-zero vectors of length 3. In general, the 
parity matrix of a (n,k) Hamming code is constructed as follows. For a given 
n — k, construct a binary n — kx 2”“^ — 1 matrix H such that each non-zero 
binary n — k-tuple occurs exactly once as a column of H. Any code with such a 
check matrix is called binary Hamming code. The code distance of any binary 
Hamming code is 3, so a Hamming code is a single-error correcting code. 

If the columns of H are permuted, the resulting code remains a Hamming 
code, since the new check matrix is a set of all possible non-zero n — k-tuples. 
Different parity check matrices can be selected to suit different purposes. For 
example, by permuting the columns of the matrix (5.5), we can get the following 
matrix: 


H = 


0 0 0 1 1 1 1 
0 110 0 11 
10 10 10 1 


(5.6) 


This matrix is a parity check matrix for a different (7,4) Hamming code. 
Note, that its column i contains a binary representation of the integer i € 
{1,2,... ,2”“^ — 1} . A check matrix satisfying this property is called lexi¬ 
cographic parity check matrix. The code corresponding to the matrix (5.5) has 
a generator matrix in standard form G = [hA^]. The code corresponding to the 
matrix (5.6) does not have a generator matrix in standard form. 

For a Hamming code with lexicographic parity check matrix, a simple proce¬ 
dure for syndrome decoding can be applied, similar to the one discussed earlier 
in Section 4.5. To check a codeword x for errors, we first calculate the syn¬ 
drome s = Hx^. If s is zero, then no error has occurred. If s is not zero, then 
it is a binary representation of some integer / € {1,2,... ,2"“^ ~ 1}- Then, x is 
decoded assuming that a single error has occurred in the /th bit of x. 


Example 5.8. Let us construct a (7,4) Hamming code corresponding to the 
parity check matrix (5.5). H in the form [A^ 73 ] where 


A^ 


110 1 
10 11 
1110 
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The generator matrix is of the form G = [/ 4 A], or 

■ 1 0 0 0 1 1 1 

0 10 0 10 1 

0010011 
0 0 0 1 1 1 0 


Suppose the data to be eneoded is d = [1110]. We multiply <i by G to get the 
codeword c = [ 1110001 ]. Suppose that an error occurs in the last bit of c, trans¬ 
forming it to [1110000]. Before decoding this word, we first check it for errors 
by multiplying it by the parity check matrix (5.5). The resulting syndrome 
s = [001] matches the last column of H, indicating that the error has occurred 
in the last bit. So, we correct [1110000] to [1110001] and then decode it by 
taking the first four bits as data d = [ 1110 ]. 


Example 5.9. The generator matrix corresponding to the lexicographic parity 
check matrix (5.6) is given by: 

' 0101010 ' 

10 0 110 0 
1111000 ' 

1 1 0 0 0 0 1 


So, the data d = [dodid 2 d 3 ] is encoded as [p 3 P 2 d 2 Pididod 2 \ where pi,p 2 ,P 3 
are parity check bits defined by pi = do + di + d 2 , P 2 = do-\- d 2 -\- d^ and 
P 3 = di + d 2 + d^. The addition is modulo 2. 


The information rate of a (7,4) Hamming code is k/n = 4/7. In general the 
rate of a (n,k) Hamming code is — 1). 

Hamming codes are widely used for DRAM error-correction. Encoding 
is usually performed on complete words, rather than individual bytes. As in 
the parity code case, when a word is written into a memory, the check bits 
are computed by a check bits generator. For instance, for a (7,4) Hamming 
code from Example 5.9, the check bits are computed as pi = do + di + d 2 , 
P 2 = do + d 2 + d^ and p^ = d\ + d 2 + dj, using a tree of XOR gates. 

When the word is read back, check bits are recomputed and the syndrome 
is generated by taking an XOR of the read and recomputed check bits. If 
the syndrome is zero, no error has occurred. If the syndrome is non-zero, it 
is used to locate the faulty bit by comparing it to the columns of H. This 
can be implemented either in hardware or in software. If an {n,k) Hamming 
code with a lexicographic parity check matrix is used, then the error correction 
can be implemented using a decoder and XOR gates. If the syndrome s = i, 
/ € { 1 , 2 ,... , 2 ”“* ~ 1 }^ the /th bit of the word is faulty. 
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Figure 5.6. Error-correcting circuit for an (7,4) Hamming code with a lexicographic parity 
check matrix. 


An example of error eorreetion for (7,4) Hamming eode from Example 5.8 
is shown in Figure 5.6. The first level of XOR gates eompares read eheek bits pr 
with reeomputed ones p. The result of this eomparison is the syndrome [sqSi S 2 ], 
whieh is fed into the deeoder. For the syndrome s = i, i € {0,1,... ,7}, the /th 
output of the deeoder is high. The seeond level of XOR gates eomplements the 
/th bit of the word, thus eorreeting the error. 

Often, the extended Hamming eode rather than the regular Hamming eode 
is used, whieh allows for not only the that single-bit error eorreetion, but also 
double-bit errors deteetion. We deseribe this eode in the next seetion. 

4.8 Extended Hamming codes 

The eode distanee of a Hamming eode is three. If we add a parity eheek bit 
to every eodeword of a Hamming eode, then the eode distanee inereases to four. 
The resulting eode is ealled extended Hamming code. It ean eorreet single-bit 
errors and deteet double-bit errors. 

The parity eheek matrix for an extended (n, k) Hamming eode ean be obtained 
by first adding a zero eolumn in front of a lexieographie parity eheek matrix of 
an (n,k) Hamming eode, and then by attaehing a row eonsisting of all I’s as 
the n — k + 1th row of the resulting matrix. For example, the matrix H for an 
extended (1,1) Hamming eode is given by 


H = 


■ 0 

1 ■ 

1 

1 
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The matrix H for an extended (3,2) Hamming eode is given by 



■ 0 

0 

1 

1 ■ 

H = 

0 

1 

0 

1 


1 

1 

1 

1 


The matrix H for an extended (7,4) Hamming eode is given by 


■ 0 

0 

0 

0 

1 

1 

1 

1 

0 

0 

1 

1 

0 

0 

1 

1 

0 

1 

0 

1 

0 

1 

0 

1 

1 

1 

1 

1 

1 

1 

1 

1 


If c = [ci, C 2 ,..., c„] is a eodeword from an (n, k) Hamming eode, then c' = 
[co,ci,C 2 , • • • ,c„] is the eorresponding extended eodeword, where cq = LLi 
is the parity bit. 

5. Cyclic codes 

Cyelie eodes are a speeial elass of linear eodes. Cyelie eodes are used in 
applieations where burst errors ean oeeur, in whieh a group of adjaeent bits is 
affeeted. Sueh errors are typieal in digital communication as well as in storage 
devices, such as discs and tapes. A scratch on a compact disk is one example 
of a burst error. Two important classes of cyclic codes which we will consider 
are cyclic redundancy check (CRC) codes, used in modems and network pro¬ 
tocols and Reed-Solomon codes, applied in satellite communication, wireless 
communication, compact disk players and DVDs. 

5.1 Definition 

A linear code is called cyclic if [c„_iCoCiC 2 ... c„_ 2 ] is a codeword whenever 
[coCiC 2 ... Cn- 2 Cn-\] is also a codeword. So, any end-around shift of a codeword 
of a cyclic code produces another codeword. 

When working with cyclic codes, it is convenient to think of words as poly¬ 
nomials rather than vectors. For a binary cyclic code, the coefficients of the 
polynomials are 0 or 1. For example, a data word \dQd 1 d 2 ■ ■ .dk-\d]^ is repre¬ 
sented as a polynomial 

r/() • 2P di' 4 “ r/2 * x^ -f ... -f dj^— \ • x^ ^ -f 

where addition and multiplication is in the field Z 2 , i.e modulo 2. The degree 
of a polynomial is equal fo ifs highesf exponenf. For example, a word [1011] 
corresponds fo fhe polynomial 1 • -f 0 • -f 1 • -f 1 • = 1 -f x^ -f x^ (leasf 

significanl bif on fhe leff). The degree of fhis polynomial is 3. 

Before continuing wifh cyclic codes, we firsf review fhe basics of polyno¬ 
mial arilhmefics, necessary for fhe undersfanding of encoding and decoding 
algorifhms. 
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5.2 Polynomial manipulation 

In this section we consider examples of polynomial multiplication and divi¬ 
sion. All the operations are carried out in the field Z 2 , i.e. modulo 2. 

Example 5.10. Compute (1 -|-x-|-x^) • (1 

(l-|-x-|-x^)-(l-|-x^) = l-|-x-|-x^-|-x^-|-x^-|-x^ = l-|-x-|-x^-|-x^ 

Note, that -\-x^ = 0, since addition is modulo 2. 

Example 5.11. Compute {\ -\-x^ -\-x^ -\-x^)/{I -\-x-\-x^). 

x^ + x^ + X + 1 _ 

x^ + X + \ \ + ? + ? + i 

x^ + x‘^ + x^ 

x'’ + -h i 

x^ -H x^ + x^ 

x^ -H x-^ + x^ -H 1 

x^ -H x^ -h X 

? + X + 

x^ -H X -h 1 

0 

So, 1 + X -|- x^ divides 1 -|- x^ -|- x^ -|- x^ without a reminder and the result is 
l-fx-hx^ + x^ 

For the decoding algorithm we also need to know how to perform arithmetic 
modulo p{x), where p{x) is a polynomial. To find /(x) mod p{x), we divide 
/(x) by p{x) and fake fhe remainder. 

Example 5.12. Compute (1 + x^ + x^) mod (1 + x + x^). 

x^ + X + 1 _ 

x^ + X + 1 I + ? + i 
x^ + x^ + x^ 

i 

X^ + X + 1 

X 

So, (1 + x^ + x^) mod (1 + X + x^) = X. 


5.3 Generator polynomial 

To encode dafa in a cyclic code, fhe polynomial represenfing fhe dafa is 
multiplied by a polynomial known as generator polynomial. The generator 
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polynomial determines the properties of the resulting eyelie eode. For example, 
suppose we eneode the data [ 1011 ] using the generator polynomial g{x) = 
I+x + x^ (least signifieant bit on the left). The polynomial representing the 
data is d(x) = 1+x^ +x^. By eomputing g{x) -d(x), we get l+x + x^ + x^ + 
+ x^. So, the eodeword eorresponding to the data [1011] is [1111111]. 

The ehoiee of the generator polynomials is guided by the property that g{x) 
is the generator polynomial for a linear eyelie eode of length n if and only if 
g{x) divides 1 + x" without a reminder. 

If n is the length of the eodeword, then the length of the eneoded data word is 
k=n — deg{g{x)) where deg{g{x)) denotes the degree of the generator polyno¬ 
mial g{x). A eyelie eode with a generator polynomial of degree n — kis ealled 
{n,k) eyelie eode. An {n,k) eyelie eode ean deteet burst errors affeeting n — k 
bits or less. 


Example 5.13. Find a generator polynomial for a eode of length n = 7 for 
eneoding data of length k = 4. 

We are looking for a polynomial of degree 7 — 4 = 3 whieh divides 1 + x^ 
without a reminder. The polynomial 1 + x’ ean be faetored as 1 + x^ = (1 + 
x+x^)(l+x^ + x^)(l + x), so weeanehoose eitherg(x) = 1+x+x^ org(x) = 
1 + x^ -fx^. Table 5.3 shows the eyelie eode generated by g{x) = 1 + x + x^. 
Sinee deg{g{x)) = 3, 3-bit burst errors ean be deteeted by this eode. 

Let C be an (n,k) eyelie eode generated by the generator polynomial g{x). 
Codewords x'g(x) are basis for C, sinee every eodeword 

d{x)g{x) = dog{x) -f dixg{x) + ... dk-ix'‘~^ g{x) 

is a linear eombination of x'g(x). So, the following matrix G with rows x‘g{x) 
is a generator matrix for C: 


six) 


go 

gi 


Sn—k 

0 

0 

0 

xg{x) 


0 

go 

gi 


Sn—k 

0 

0 

x^-^gix) 


0 

0 

0 

go 

gl 


Sn-k 0 

_ ^ . 


0 

0 

0 

0 

go 

gl 

Sn—k 


Every row of G is a right eyelie shift of the first row. This generator matrix 
leads to a simple eneoding algorithm using polynomial multiplieation by g{x). 


Example 5.14. If C is a binary eyelie eode with the generator polynomial 
g{x) = 1 -|-x-|-x^, then the generator matrix is given by: 

■ 1 1 0 1 0 0 0 ' 

0 110 10 0 
0011010 ■ 

0 0 0 1 1 0 1 
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do 

data 
d\ dj 

d3 

Co 

Cl 

codeword 

C2 C3 C4 

C5 

C6 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

1 

1 

0 

1 

0 

0 

1 

0 

0 

0 

1 

1 

0 

1 

0 

0 

0 

1 

1 

0 

0 

1 

0 

1 

1 

1 

0 

1 

0 

0 

0 

1 

1 

0 

1 

0 

0 

0 

1 

0 

1 

0 

1 

1 

1 

0 

0 

1 

0 

1 

1 

0 

0 

1 

0 

1 

1 

1 

0 

0 

1 

1 

1 

0 

1 

0 

0 

0 

1 

1 

1 

0 

0 

0 

1 

1 

0 

1 

0 

0 

0 

1 

0 

0 

1 

1 

1 

0 

0 

1 

0 

1 

1 

0 

1 

0 

1 

1 

1 

0 

0 

1 

0 

1 

0 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

0 

0 

1 

0 

1 

1 

1 

0 

0 

1 

1 

0 

1 

1 

0 

1 

0 

0 

0 

1 

1 

1 

1 

0 

1 

0 

0 

0 

1 

1 

0 

1 

1 

1 

1 

1 

0 

0 

1 

0 

1 

1 


Table 5.3. Defining table for (7,4) with the generator polynomial g(x) = l+x + x^. 


5.4 Parity check polynomial 

Given a cyclic code C with the generator polynomial g{x), the polynomial 
h{x) determined by 

g{x)h{x) = 1 + x" 

is the check polynomial of C. Since the codewords of a cyclic code are multiples 
of g{x), for every codeword c{x) € C, it holds that 

c{x)h{x) = d{x)g{x)h{x) = d{x){l +v”) = 0 mod 1 +x”. 


A parity check matrix H contains as its first row the coefficient of h{x), 
starting from the most significant one. Every following row of // is a right 
cyclic shift of the first row. 


H = 


hk hg— 1 ... Hq 0 0 ... 0 

0 /i/t hg—\ ... Hq 0 ... 0 

0 ... 0 hg hj^— 1 ... ho 0 

0 ... 0 0 hji h]^— I ... Hq 
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Example 5.15. Suppose C is a binary cyclic code with the generator polyno¬ 
mial g{x) = 1+x + x^. Let us compute its check polynomial. 

There are three factors ofl+x’: 1 + x,l + x + x^ and \-\- x^ -\- x^. Thus, 
h{x) = (1 + ) (1 + x) = 1 + X + x^ + x^. The corresponding parity check 

matrix is given by 


H = 


10 1110 0 
0 10 1110 
0 0 10 111 


(5.7) 


5.5 Syndrome polynomial 

Since cyclic codes are linear codes, we can use the parity check polynomial 
to detect errors which might have possibly occurred in a codeword c(x) during 
data transmission or storage. We define 

s{x) =h{x)c{x) mod 1+x” 

to be the syndrome polynomial. Since g{x)h{x) = 1+x”, syndrome can be 
computed by dividing the c(x) by g{x). If the reminder ^(x) is zero, then c(x) 
is a codeword. Otherwise, there is an error in c(x). 

5.6 Implementation of polynomial division 

The polynomial division can be implemented by linear feedback shift reg¬ 
isters (LFSR). The logic diagram of an LFSR for the generator polynomial of 
degree r = n — k is shown in Figure 5.7. It consists of a simple shift register 
and binary-weighted modulo 2 sums with feedback connections. Weights gi, 
i E {0,1,..., r — 1}, are the coefficients of the generator polynomial g{x) = 
gox^ + + ---gnx'‘. Each gi is either 0, meaning “no connection”, or 1, 

meaning “connection”. An exception is gr which is always 1 and therefore is 
always connected. 

If the input polynomial is c(x), then the LFSR divides c(x) by the generator 
polynomial g{x), resulting in the quotient d{x) and the reminder ^(x). The 
coefficients [i'o^s'i... 5^] of the syndrome polynomial are contained in the register 
after the division is completed. If syndrome is zero, then c(x) is a codeword 
and d{x) is valid data. If [i'o^s'i • • • ^r] matches one of the columns of parity check 
matrix H, then a single-bit error has occurred. The bit position of the error 
corresponds to the position of the matching column in H, so the error can be 
corrected. If the syndrome is not zero and is not equal to any of the columns of 
H, then a multiple-bit error has occurred which cannot be corrected. 

Example 5.16. As an example, consider the logic circuit shown in Figure 5.8. 
It implements LFSR for the generator polynomial g{x) = 1 + x + x^. Let s'l~ 
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Figure 5.7. Logic diagram of a Linear Feedback Shift Register (LFSR). 


denotes the next state value of the register eell 5,-. Then, the state values of the 
LFSR in Figure 5.8 are given by 

‘5'o =52 + c{x) 

= ^0 + ‘52 
4 = 51 



dock 


d{x) 


Figure 5.8. Implementation of the decoding circuit for a cyclic code with generator polynomial 
g{x) = l+x + x^. 


Suppose the word to be deeoded is [1010001], i.e. 1 + x^. Table 5.4 

shows the values of the register. This word is fed into the LFSR with the most 
signiheant bit first. The first bit of the quotient (the most signifieant one) appears 
at the output at the 4th eloek eyele. In general, the first bit of the quotient eomes 
out at the eloek eyele r + 1 for an LFSR of size r = n — k. After the division is 
eompleted at eyele 7 (eyele n for the general ease), the state of the register is 
[000], so [1010001] is a eodeword and the quotient [1101] is valid data. We ean 
verify the obtained result by dividing 1 + by 1+x + x^. The quotient is 

l+x + x^, whieh is indeed [1101]. 

Example 5.17. Next, suppose that a single-bit error has oeeurred in the 4th 
bit of eodeword [1010001], and a word [1011001] is reeeived instead. Table 5.5 
illustrates the deeoding proeess. As we ean see, after the division is eompleted, 
the registers eontain the reminder [110], whieh matehes the 4th eolumn of the 
parity eheek matrix H in (5.7). 
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clock 

input 

register 

state 

output 

period 

c(x) 

■So 

■Si 


d{x) 

0 


0 

0 

0 


1 

1 

1 

0 

0 

0 

2 

0 

0 

1 

0 

0 

3 

0 

0 

0 

1 

0 

4 

0 

1 

1 

0 

1 

5 

1 

1 

1 

1 

0 

6 

0 

1 

0 

1 

1 

7 

1 

0 

0 

0 

1 


Table 5.4. Register values of the circuit in Figure 5.8 for the input [1010001]. 


clock 

input 

register 

state 

output 

period 

c{x) 

■50 

■Si 


d{x) 

0 


0 

0 

0 


1 

1 

1 

0 

0 

0 

2 

0 

0 

1 

0 

0 

3 

0 

0 

0 

1 

0 

4 

1 

0 
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0 

1 

5 

1 

1 

0 

1 

0 

6 

0 

1 

0 

0 

1 

7 

1 

1 

1 

0 

0 


Table 5.5. Register values of the circuit in Figure 5.8 for the input [1011001]. 


5.7 Separable cyclic codes 

The cyclic codes that we have studied so far were not separable. It is possible 
to construct a separable cyclic code by applying the following technique. 

First, we take the data d = [d^di... dk-\\ to be encoded and shift it right by 
n — k positions: 

[0,0,..., 0, t/o, d\ , • • •, 

Shifting the vector d right hy n — k positions corresponds to multiplying the 
data polynomial d{x) by term.r"“^: 

d{x)x"~^ = doxf‘~^ + dixf~^'^^ + ... + 


DRAFT 


March 25, 2008, 2:12ain 


DRAFT 





96 


FAULT TOLERANT DESIGN: AN INTRODUCTION 


Next, we employ the division algorithm to write 

= q{x)g{x) + r{x) 

where q{x) is a quotient and r{x) is a reminder of division of d{x)x"~^ by the 
generator polynomial g{x). The reminder r{x) has degree less than n — k, i.e. it 
is of type 

[ro,ri,...,rn-k-\,0,0,... ,0] 

By moving r{x) from the right hand side of the equation to the left hand side of 
the equation we get: 

d{x)x"~^ + r{x) = q{x)g{x) 

Reeall, that ” is equivalent to “+” in Z 2 . Sinee the left hand side of this 
equation is equal to a multiple of g{x), it is a eodeword. This eodeword has the 
form 

[ro,ri,...,r„_i,-udodi...dk] 

So, we have obtained a eodeword in whieh the data is separated from the eheek 
bits. 

Example 5.18. Let us demonstrate a systematie eneoding for a (7,4) eode 
with the generator polynomial g(x) = 1+x+x^. Let(i(x) = x+x^,i.e. [0101]. 

First, we eompute x"~^d{x) = + + x^. Next, we employ the 

division algorithm: 

x^Tx^ = (1 Tx^)(1TxTx^)T(xT1) 

So, the resulting eodeword is 

c{x) = d{x)x"~’^ + r{x) = 1+x + x^ + x^ 

i.e. [1100101]. We ean easily separate the data part of the eodeword, it is 
eontained in the last four bits. 


Example 5.19. Suppose C is a binary separable eyelie eode with the generator 
polynomial g(x) = 1+x+x^. Compute its generator and parity eheek matriees. 

Consider the parity eheek matrix (5.7). To obtain a separable eode, we need 
to permute its eolumns to bring it to the form H = [A^In-k] ■ One of the solutions 
is: 


H = 


10 1110 0 
1110 0 10 
0 1110 0 1 


The eorresponding generator matrix G = [4A] is: 

■ 1 0 0 0 1 1 0 

0 1 0 0 0 1 1 

0010111 
0 0 0 1 1 0 1 
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Since the encoding of a separable cyclic code involves division, it can be 
implemented using an LFSR identical to the one used for decoding. The mul¬ 
tiplication by is done by shifting the data right hy n — k positions. After 
the last bit of d (x) has been fed in, the LFSR contains the reminder of division 
of input polynomial by g{x). By subtracting this reminder from x"~^d{x) we 
obtain the encoded word. 

5.8 CRC codes 

Cyclic redundancy check (CRC) codes are separable cyclic codes with spe¬ 
cific generator polynomials, chosen to provide high error detection capability 
for data transmission and storage. Common generator polynomials for CRC are: 

CRC-16: l+x^ + x^^+x^^ 

CRC-CCITT: l+x^+x^^ + x^^ 

CRC-32:1+x + x^ + xHx'^+x^+x^^ + x^^+x^^ + x^^+x^^ + x^^+x^^+x^^ 


CRC-16 and CRC-CCITT are widely used in modems and network protocols 
in the USA and Europe, respectively, and give adequate protection for most 
applications. An attractive feature of CRC-16 and CRC-CCITT is the small 
number of non-zero terms in their polynomials (just four). This is an advantage 
because the LFSR required to implement encoding and decoding is simpler for 
generator polynomials with a smaller number of terms. Applications that need 
extra protection, such as Department of Defense applications, use CRC-32. 

The encoding and decoding is done either in software, or in hardware, using 
the procedure from Section 5.7. To perform an encoding, the data polynomial 
is first shifted right by deg{g{x)) bit positions, and then divided by the gen¬ 
erator polynomial. The coefficients of the remainder form the check bits of 
the CRC codeword. The number of check bits is equal to the degree of the 
generator polynomial. So, a CRC detects all burst errors of length less or equal 
to deg{g{x)). A CRC also detects many errors which are larger than deg{g{x)). 
For example, apart from detecting all burst errors of length 16 or less, CRC-16 
and CRC-CCITT are also capable to detect 99.997% of burst errors of length 
17 and 99.9985 burst errors of length 18. 

5.9 Reed-Solomon codes 

Reed-Solomon (RS) codes are a class of separable cyclic codes used to correct 
errors in a wide range of applications including storage devices (tapes, com¬ 
pact discs, DVDs, bar-codes), wireless communication (cellular telephones, mi¬ 
crowave links), satellite communication, digital television, high-speed modems 
(ADSL, xDSL). 
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The encoding for Reed-Solomon codes is done similarly to the procedure 
described in Section 5.7. The codeword is computed by shifting the data right 
n — k positions, dividing it by the generator polynomial and then adding the 
obtained reminder to the shifted data. A key difference is that groups of m bits 
rather than individual bits are used as symbols of the code. Usually m = 8, i.e. 
a byte. The theory behind is a field Z^of degree m over {0,1}. The elements 
of such a field are m-fuples of 0 and 1, rafher fhan jusf 0 and 1. 

An encoder for an Reed-Solomon code lakes k dala symbols of s bils each 
and compules a codeword conlaining n symbols of m bils each. The maximum 
codeword lenglh is relaled lo m as n = 2"* — 1. A Reed-Solomon code can 
correcl up lo \n — k\ jl symbols lhal conlain errors. 

For example, a popular Reed-Solomon code is RS(255,223) where symbols 
are a byle (8-bil) long. Each codeword conlains 255 byles, of which 223 byles 
are dala and 32 byles are check symbols. So, n = 255, k = 223 and Iherefore 
Ibis code can correcl up lo 16 byles conlaining errors. Note, lhal each of Ihese 
16 bytes can have mulliple bil errors. 

The decoding of Reed-Solomon codes is performed using an algorilhm de¬ 
signed by Berlekamp. The popularity of Reed-Solomon codes is due lo a large 
exlenl lo Ihe efficiency Ihis algorilhm. Berlekamp’s algorilhm was used by 
Voyager II for Iransmilling piclures of Ihe outer space back lo Earlh. If is also a 
basis for decoding CDs in players. Many addilional improvemenls were done 
over Ihe years lo make Reed-Solomon code practical. Compacl discs, for exam¬ 
ple, use a modified version of RS code called cross-interleaved Reed-Solomon 
code. 

6. Unordered codes 

Unordered codes are designed lo delecl unidirectional errors. A unidirec¬ 
tional error is an error which changes eilher O’s of Ihe word lo 1, or I’s of 
Ihe word lo 0, bul nol bolh. An example of a unidirectional error is an error 
changing a word [1011000] lo Ihe word [0001000]. Il is possible lo apply a 
special design technique lo ensure lhal mosl of Ihe faulls occurring in a logic 
circuil cause only unidirectional errors on Ihe oulpul. Eor example, consider 
Ihe logic circuil shown in Eigure 5.9. If a single sluck-al faull occurs al any of 
Ihe lines in Ihe circuil, il will cause a unidirectional error in Ihe oulpul word 
[fififs]- 

The name of unordered codes originates from Ihe following. We say lhal 
Iwo binary n-luples x = (xi,... ,x„) and y = (xi,... ,x„) are ordered if eilher 
Xi < yi for all / € {1,2,..., n}, or x,- > y,- for all i. Eor example if x = [0101] and 
y = [0000] Ihen x and y are ordered, namely x > y. A unordered code is a code 
satisfying Ihe property lhal any Iwo of ils codewords are unordered. 

The ability of unordered codes lo delecl all unidirectional errors is direclly 
related lo Ihe above properly. A unidirectional error always changes a word 
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Xi 

X2 

X3 

X4 

X 5 



Figure 5.9. Logic diagram of a circuit in which any single stuck-at fault cause a unidirectional 
error on the output. 


X to a word y which is either smaller or greater than x. A unidirectional error 
cannot change x to a word which is not ordered with x. Therefore, if any two 
of its codewords of a code are unordered, then a unidirectional error will never 
map a codeword to another codeword, and thus will be detected. 

In this section we describe two unordered codes: m-of-n codes and Berger 
codes. 

6.1 M-of-n codes 

A m-of-n code consists of all n-bit words with exactly m I’s. Any ^-bit 
unidirectional error forces the affected codeword to have either m ^ of m — ^ 
Ts, and thus detected. 

An easy way to construct an m-of-n code is to take the original k bits of data 
and append k bits so that the resulting 2k-bit code word has exactly k Ts. For 
example, the 3-of-6 code is shown in Table 5.6. All codewords have exactly 
three Ts. 

An obvious disadvantage of an 2k-oi-k is its low information rate of 1/2. 
An advantage of this code is its separability, which simplifies the encoding and 
decoding procedures. A more efficient m-of-n code, with higher information 
rate can be constructed, but then the separable nature of the code is usually lost. 
Non-separability makes the encoding, decoding and error detection procedures 
more difficult. 

6.2 Berger codes 

Check bits in a Berger code represent the number of Ts in the data word. A 
B erger code of length n has k data bits and m check bits, where m =\log 2 (^+ 1 )] 
and n = k + m. A codeword is created by complementing the m-bit binary 
representation of the number of 1 ’s in the encoded word. An example of Berger 
code for 4-bit data is shown in Table 5.7. 
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do 

data 

d\ 

d2 

Co 

Cl 

eodeword 

C2 C3 

C4 

C5 

0 

0 

0 

0 

0 

0 

1 

1 

1 

0 

0 

1 

0 

0 

1 

1 

1 

0 

0 

1 

0 

0 

1 

0 

1 

0 

1 

0 

1 

1 

0 

1 

1 

1 

0 

0 

1 

0 

0 

1 

0 

0 

0 

1 

1 

1 

0 

1 

1 

0 

1 

0 

1 

0 

1 

1 

0 

1 

1 

0 

0 

0 

1 

1 

1 

1 

1 

1 

1 

0 

0 

0 


Table 5.6. Defining table for 3-of-6 code. 


do 

data 
di dj 

d3 

Co 

Cl 

eodeword 

C2 C3 C4 

C5 

C6 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

0 

0 

0 

1 

0 

0 

0 

1 

1 

1 

0 

0 

0 

1 

0 

0 

0 

1 

0 

1 

1 

0 

0 

0 

1 

1 

0 

0 

1 

1 

1 

0 

1 

0 

1 

0 

0 

0 

1 

0 

0 

1 

1 

0 

0 

1 

0 

1 

0 

1 

0 

1 

1 

0 

1 

0 

1 

1 

0 

0 

1 

1 

0 

1 

0 

1 

0 

1 

1 

1 

0 

1 

1 

1 

1 

0 

0 

1 

0 

0 

0 

1 

0 

0 

0 

1 

1 

0 

1 

0 

0 

1 

1 

0 

0 

1 

1 

0 

1 

1 

0 

1 

0 

1 

0 

1 

0 

1 

0 

1 

1 

0 

1 

1 

1 

0 

1 

1 

1 

0 

0 

1 

1 

0 

0 

1 

1 

0 

0 

1 

0 

1 

1 

1 

0 

1 

1 

1 

0 

1 

1 

0 

0 

1 

1 

1 

0 

1 

1 

1 

0 

1 

0 

0 

1 

1 

1 

1 

1 

1 

1 

1 

0 

1 

1 


Table 5.7. Defining table for Berger code for 4-bit data. 


The primary advantages of a Berger eode are that it is a separable eode and 
that it deteets all unidireetional multiple errors. It is shown that the Berger eode 
is the most eompaet eode for this purpose. The information rate of a Berger eode 
for m-bit data is k/{\k-\- [log 2 (^+ 1)])- Table 5.8 shows how the information 
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rate grows as the size of the eneoded data inereases. For data of small size, 
the redundaney of a Berger eode is high. However, as k inereases, the number 
of eheek bits drops substantially. The Berger eodes with k = 2"* — 1 are ealled 
maximal length Berger eodes. 


number of 
data bits 

number or 
check bits 

information 

rate 

4 

3 

0.57 

8 

4 

0.67 

16 

5 

0.76 

32 

6 

0.84 

64 

7 

0.90 

128 

8 

0.94 


Table 5.8. Information rate of different Berger codes. 


7. Arithmetic codes 

Arithmetie eodes are usually used for deteeting errors in arithmetie opera¬ 
tions, such as addition or multiplication. The data representing the operands, 
say b and c, is encoded before the operation is performed. The operation is 
carried out on the resulting codewords A{b) and A(c). After the operation, the 
codeword A{b) *A{c) representing the result of the operation is decoded 
and checked for errors. 

An arithmetic code relies on the property of invariance with respect to the 
operation 

A{b*c) = A(b) *A(c) 

Invariance guarantees that the operation on codewords A(^) and A(c) gives 
us the same result as A{b * c). So, if no error has occurred, decoding A{b * c) 
gives us ^ * c, the result of the operation on b and c. 

Two common types of arithmetic codes are AA-codes and residue codes. 

7.1 AN-codcs 

AA-code is the simplest representative of arithmetic codes. The codewords 
are obtained by multiplying data words A by some constant A. For example, 
if the data is of length two, namely [00], [01], [10], [11], then the 3A-code is 
[0000], [0011], [0110], [1001]. Each codeword is computed by multiplying a 
data word by 3. And vice versa, to decode a codeword, we divide it by 3. If 
there is no reminder, no error has occurred. AA-codes are non-separable codes. 
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The AN-coAe is invariant with respect to addition and subtraction, but not 
to multiplication and division. For example, clearly 3{a ■ b) ^ 3a ■ 3b for all 
non-zero a, b. 

The constant A determines the information rate of the code and its error de¬ 
tection capability. For binary codes, A should not be a power of two. This is 
because single-bit errors cause multiplication or division of the original code¬ 
word by T', where r is the position of the affected bit. Therefore, the resulting 
word is a codeword and the error cannot be detected. 

7.2 Residue codes 

Residue codes are separable arithmetic codes which are created by computing 
a residue for data and appending it to the data. The residue is generated by 
dividing a data by an integer, called modulus. Decoding is done by simply 
removing the residue. 

Residue codes are invariant with respect to addition, since 
{b + c) mod m = b mod m + c mod m 

where b and c are data words and m is modulus. This allows us to handle 
residues separately from data during addition process. The value of the modulus 
determines the information rate and the error detection capability of the code. 

A variation of residue codes are inverse residue code, where an inverse of 
the residue, rather than the residue itself, is appended to the data. These codes 
are shown to have better fault detecting capabilities for common-mode faults. 

8. Problems 

5.1. Give an example of a binary code of length 4 and of size 6. How many 
words are contained in the codespace of your code? 

5.2. Why is the separability of a code considered to be a desirable feature? 

5.3. Define the information rate. How is the information rate related to redun¬ 
dancy? 

5.4. What is the main difference in the objectives of encoding for the coding 
theory and the cryptography? 

5.5. What is the maximum Hamming distance between two words in the codespace 

{ 0 , 1 }"? 

5.6. Consider the code C = {01100101110,10010110111,01010011001}. 

(a) What is the code distance of C? 

(b) How many errors can be detected/corrected by code C? 
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5.7. How would you generalize the notions of Hamming distanee and eode dis- 
tanee to ternary eodes using {0,1,2} as valid symbols? Find a generalization 
whieh preserves the following two properties: To be able to eorreet s-digit 
errors, a ternary eode should have the eode distanee of at least 2s + 1. To 
be able to deteet s-digit errors, the ternary eode distanee should be at least 
s+ 1. 

5.8. Prove that, for any n > 1, a parity eode of length n has eode distanee two. 

5.9. (a) Construet an even parity eode C for 3-bit data. 

(b) Suppose the word (1101) is reeeived. Assuming single bit error, what 
are the eodewords that have possibly been transmitted? 

5.10. Draw a gate-level logie eireuit of an odd parity generation eireuit for 5-bit 
data. Limit yourself to use of two-input gates only. 

5.11. How would you generalize the notion of parity for ternary eodes? Give an 
example of a ternary parity eode for 3-digit data, satisfying your definition. 

5.12. Construet the parity eheek matrix H and the generator matrix G for a linear 
code for 4-bit data which can: 

(a) detect 1 error 

(b) correct 1 error 

(c) correct 1 error and detect one additional error. 

5.13. Construct the parity check matrix H and the generator matrix G for a linear 
code for 5-bit data which can: 

(a) detect 1 error 

(b) correct 1 error 

(c) correct 1 error and detect one additional error. 

5.14. (a) Construct the parity check matrix H and the generator matrix G of a 

Hamming code for 11-bit data. 

(a) Find whether you can construct a Hamming code for data of lengths 1,2 
and 3. Construct the parity check matrix H and the generator matrix G 
for whose lengths for which it is possible. 

5.15. The parity code is a linear code, too. Construct the parity check matrix H 
and the generator matrix G for a parity code for 4-bit data. 

5.16. Find the generator matrix for the (7,4) cyclic code C with the generator 
polynomial \-\-x^ -\-x^. Prove that C is a Hamming code. 
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5.17. Find the generator matrix for the (15,11) cyclic code C with the generator 
polynomial 1 +x + x^. Prove that C is a Hamming code. 

5.18. Compute the check polynomial for the (7,4) cyclic code with the generator 
polynomial g(x) = l+x^ + x^. 

5.19. Let C be and (n,k) cyclic code. Prove that the only burst errors of length 
n — k + 1 that are codewords (and therefore not detectable errors) are shifts 
of scalar multiples of the generator polynomial. 

5.20. Suppose you use a cyclic code generated by the polynomial g(x) = l+x+x^. 
You have received a word c{x) = l+x + x^ + x^. Check whether an error 
has occur during transmission. 

5.21. Develop an LFSR for decoding of 4-bit data using the generator polynomial 
g{x) = I+x‘^. Show the state table for the word c{x) = -f -f -f -f 1 
(as the one in Table 5.4). Is c(x) a valid codeword? 

5.22. Construct a separable cyclic code for 4-bit data generated by the polynomial 
g{x) = 1 -|-x-|-x^. What code distance has the resulting code? 

5.23. (a) Draw an LFSR decoding circuit for CRC codes with the following gen¬ 

erator polynomials: 

CRC-16: 1+x^ + x^^ + x^^ 

CRC - CCITT -A+x^+x^^A- x^^ 

You may use "..." between the registers 2 and 15 in the 1st polynomial 
and 5 and 12 in the second, to make the picture shorter. 

(b) Use the first generator polynomial for encoding the data \+x^ + x^. 

(c) Suppose that the error 1 -|- x -f x^ is added to the codeword you obtained 
in the previous task. Check whether this error will be detected or not. 

5.24. Construct a Berger code for 3-bit data. What code distance has the resulting 
code? 

5.25. Suppose we know that the original 4-bit data words will never include the 
word 0000. Can we reduce the number of check bits required for a Berger 
code and still cover all unidirectional errors? 

5.26. Suppose we encoded 8-bit data using a Berger code. 

(a) How many check bits are required? 

(b) Take one codeword c and list all possible unidirectional errors which 
can affect c. 
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5.27, (a) Construct a 3N arithmetic code for 3-bit data. 

(b) Give an example of a fault which is detected by such a code and an 
example a of fault which is not detected by such a code. 

5.28, Consider the following code C: 

0 0 0 0 0 0 

0 0 0 1 0 1 

0 0 10 10 
0 0 1111 
0 10 10 0 
0 110 0 1 

0 11110 
1 0 0 0 1 1 

(a) What kind of code is this? 

(b) Is it a separable code? 

(c) What is the code distance of C? 

(d) What kind of faults can it detect/correct? 

(e) How are encoding and decoding done for this code? (describe in words) 

(f) How error is detection done for this code? (describe in words) 

5.29, Consider the following code C: 


0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

1 

1 

0 

0 

0 

1 

0 

0 

1 

0 

0 

1 

1 

0 

0 

0 

0 

1 

1 

1 

1 

0 

1 

0 

0 

1 

0 

0 

1 

0 

1 

0 

1 

0 

1 

1 

0 

0 

0 

0 

1 

1 

0 

1 

1 

0 

1 

1 

1 

1 

0 

1 

0 

0 

0 

0 

1 

1 

0 

0 

1 

0 

0 

1 

0 

0 

1 

1 

1 

1 

0 

1 

0 

1 

0 

1 

0 

1 

1 

0 

1 

(a) What kind of code is this? 
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(b) Is it a separable eode? 

(e) What is the eode distanee of C? 

(d) What kind of faults ean it deteet/eorreet? 

(e) How are encoding and decoding done for this code? (describe in words) 

(f) How is error detection done for this code? (describe in words) 

5.30. Consider the following code: 

0 0 0 1 1 1 
0 0 1110 
0 10 10 1 
0 1110 0 
1 0 0 0 1 1 
10 10 10 
1 1 0 0 0 1 
1110 0 0 

(a) What kind of code is this? 

(b) Is it a separable code? 

(c) What is the code distance of C? 

(d) What kind of faults can it detect/correct? 

(e) Design a circuit (gate-level) for encoding of 3-bit data in this code. Your 
circuit should have 3 inputs for data bits and 6 outputs for codeword 
bits. 

(f) How would you suggest to do error detection for this code? (describe 
in words). 

< 

5.31. Develop a scheme for active hardware redundancy (either standby sparing 
or pair-and-a-spare) employing error detection code of your choice for 1-bit 
error detection. 
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TIME REDUNDANCY 


1. Introduction 

Space redundancy techniques discussed so far impact physical entities like 
cost, weight, size, power consumption, etc. In some applications extra time is 
of less importance than extra hardware. 

Time redundancy is achieved by repeating the computation or data trans¬ 
mission and comparing the result to a stored copy of the previous result. If 
the repetition is done twice, and if the fault which has occurred is transient, 
then the stored copy will differ from the re-computed result, so the fault will be 
detected. If the repetition is done three or more times, a fault can be corrected. 
In this section, we show that time redundancy techniques can also be used for 
detecting permanent faults. 

Apart from detection and correction of faults, time redundancy is useful for 
distinguishing between transient and permanent faults. If the fault disappears 
after the re-computation, it is assumed to be transient. In this case the hardware 
module is still usable and it would be a waste of resources to switch it off the 
operation. 

2. Alternating logic 

The alternating logic time redundancy scheme was developed by Reynolds 
and Metze in 1978. It has been applied to permanent fault detection in digital 
data transmission and in digital circuits. 

Suppose the data is transmitted over a parallel bus as shown in Figure 6.1. 
At time Iq, the original data is transmitted. Then, the data is complemented and 
re-transmitted at time to -|- A. The two results are compared to check whether 
they are complements of each other. Any disagreement indicates a fault. Such 
a scheme is capable of detecting permanent stuck-at faults at the bus lines. 
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to + A 
0 
1 


0 



Figure 6.1. Alternating logic time redundancy scheme. 


Alternating logic concept can be used for detecting fault in logic circuits 
which implement self-dual functions. A dual of a function /(xi,X2, • • • ,x„) is 
defined as 

fd{xuX 2 ,. .. ,x„) = /'(Xi,X2, .. . ,X^), 

where ' denotes the complement. For example, a 2-variable AND /(xi,X 2 ) = 
x\ ■X 2 is dual of a 2-variable OR (xi,X 2 ) = x\ -\-X 2 , and vice versa. A function 
in said to be self-dual if it is equal to its dual / = fd- So, the value of a self¬ 
dual function / for the input assignment xi,X2,...,x„ equals to the value of 
the complement of / for the input assignment Xj ,X2, ... ,x^. Examples of self¬ 
dual functions are sum and carry-out output functions of a full-adder (Figure 
6.2). The sum y(a,(r,c,>j) = a®b®Cin, where ”0” is an XOR. The carry-out 



Figure 6.2. Logic diagram of a full-adder. 


Cout{6i^b^Cin) = ab-\- {a®b)cin. Table 6.1 shows the defining fable for s and 
Cout- It is easy fo see fhaf fheproperly /(xi,X 2 ,... ,x„) = /'(xj,x' 2 ,... ,xjj) holds 
for bolh funclions. 

For a circuil implemenling a self-dual function, fhe applicalion of an inpul as- 
signmenl (xi ,X 2 ,... ,x„) followed by fhe inpul assignmenl {x\ ,X 2 , • • • An) should 
produce oulpul values which are complemenfs of each olher, unless fhe circuil 
has a fault So, a faull can be delected by finding an inpul assignmenl for which 
/(xi,X2,---An) =• • • 5 -^)- For example, a sluck-al-1 faull marked in 
Figure 6.2 can be detected by applying Ihe inpul assignmenl (fl;,Zr,c,„) = (100), 
followed by Ihe complemented assignmenl (Oil). In a faull-free full-adder 
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a 

b 

^in 

5 

^out 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

1 

0 

1 

0 

0 

1 

1 

0 

1 

1 

0 

0 

1 

0 

1 

0 

1 

0 

1 

1 

1 

0 

0 

1 

1 

1 

1 

1 

1 


Table 6.1. Defining table for a full-adder. 


5 ( 100 ) = l,Co„f(100) =0and5(011) =0,Cowt(011) = 1. However, in presence 
of the marked fault 5(100) = 1, Cowt(lOO) = 1 and 5(011) = 0, Cowt(Oll) = 1. 
Since CoMf(lOO) = (Oil), the fault is detected. 

If the function /(xi,X 2 ,... ,x„) realized by the circuit is not self-dual, then it 
can be transformed to a self-dual function of n-|-1-variables, defined by 

fsd ^n+lf “h ^n+lfd 

The new variable 1 is a control variable determining whether the value of / or 
fd appears on the output. Clearly, such a function fsd produces complemented 
values for complemented inputs. A drawback of this technique is that the circuit 
implementing fsd can be twice as large as the circuit implementing /. 

3. Recomputing with shifted operands 

Recomputing with shifted operands (RESO) time redundancy technique was 
developed by Patel and Fung in 1982 for on-line fault detection in arithmetic 
logic units (ALUs) with bit-sliced organization. 

At time to, the bit slice i of a circuit performs a computation. Then, the data 
is shifted left and the computation is repeated at time to + 6. The shift operand 
can be either arithmetic or logical shift. After the computation, the result is 
shifted right. The two results are compared. If there is no error, they are the 
same. Otherwise, they disagree in either the /th, or (/ — l)th, or both bits. 

The fault detection capability of RESO depends on the amount of shift. For 
example, for a bit-sliced ripple-carry adder, a 2-bit arithmetic shift is required 
to guarantee the fault detection. A fault in the /th bit of a slice can have one of 
the three effects: 

1 The sum bit is erroneous. Then, the incorrect result differs form the correct 
one by either —2' (if the sum is 0), or by +2' (if the sum is 1). 


DRAFT 


March 25, 2008, 2:12ain 


DRAFT 



110 


FAULT TOLERANT DESIGN: AN INTRODUCTION 


2 The carry bit is erroneous. Then, the incorrect result differs from the correct 

one by either —2'+^ (if the carry is 0), or by +2'+^ (if the carry is 1). 

3 Both, sum and carry bits, are erroneous. Then, we have four possibilities 

■ sum is 0, carry is 0: —3-2'; 

■ sum is 0, carry is 1: +2'; 

■ sum is 1, carry is 0: —2‘; 

■ sum is 1, carry is 1: +3 • 2'; 

Summarizing, if the operands are not shifted, then the erroneous result differs 
from the correct one by one of the following values: {0,2', 2'''“\ 3 • 2'''“^}. 

A similar analysis can be done to show that if the operands are shifted left 
by two bits, then the erroneous result differs from the correct one by one of the 
following values: {0,2'“^,2'“^,3-2'“^}. So, results of non-shifted and shifted 
computations cannot agree unless they are both correct. 

A primary problem with RES O technique is the additional hardware required 
to store the shifted bits. 

4. Recomputing with swapped operands 

Recomputing with swapped operands (RESWO) is another time redundancy 
technique, introduced by Johnson in 1988. In RESWO, both operands are split- 
ted into two halves. During the first computation, the operands are manipulated 
as usual. The second computation is performed with the lower and the upper 
halves of operands swapped. 

RESWO technique can detect faults in any single bit slice. Eor example, 
consider a bit-sliced ripple-carry adder with n-bit operands. Suppose the lower 
half of an operand contains the bits from 0 to r = n/2 and the upper half contains 
the bits from r -f 1 to n — 1. During the first computation, if the sum or carry 
bits from slice i are faulty, then the resulting sum differs from the correct one 
by 2' and 2'''“^, respectively. If both, the sum and the carry are faulty, then the 
result differs from the correct one by 2'2'+^ So, before the operands’ halves 
are swapped, a faulty bit slice i would cause the result to disagree from the 
correct result by one of the values {0,2',2'+*,2'2'+^}. If i < r, the result of 
the re-computation with the lower and the upper halves of operands swapped 
differs from the correct result by one of the values {0,2'“'',2'“'‘“^,2'“''2'“'‘“^}. 
This implies that the results of non-swapped and swapped computations cannot 
agree unless they are both correct. 

5. Recomputing with duplication with comparison 

Recomputing using duplication with comparison (REDWC) technique com¬ 
bines hardware redundancy with time redundancy. An n-bit operation is per¬ 
formed by using two n/2-bit devices twice. The operands are split into two 
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halves. First, the operation is earried out on the lower halves and their dupli- 
eates and the results are eompared. This is then repeated for the upper halves 
of the operands. 

As an example, eonsider how REDWC is performed on an n-bit full adder. 
First, lower and upper parts of the adder are used to eompute the sum of the 
lower parts of the operands. A multiplexer is used to handle the earries at the 
boundaries of the adder. The results are eompared and one of them is stored 
to represent the lower half of the final sum. The seeond eomputation is earried 
out on the upper parts of the operands. Seleetion of the appropriate half of the 
operands is performed using multiplexers. 

REDWC teehnique allows to deteet all single faults in one half of the adder, 
as long as both halves do not to beeome faulty in a similar manner or at the 
same time. 

6. Problems 

6.1. Give three examples of applieations where time is less important than hard¬ 
ware. 

6.2. Two independent methods for fault deteetion on busses are: 

■ the use of a parity bit, 

■ the use of alternating logie. 

Neither of these methods has the eapability of eorreeting an error. However, 
together these two methods ean be used to eorreet any single permanent fault 
(stuek-at type). Explain how. Use an example to illustrate your algorithm. 

6.3. Write a truth table for a 2-bit full adder and eheek whether the sum s and 
the earry out Cout are self-dual funetions. 


DRAFT 


March 25 


2008 


2:12ain 


DRAFT 




Chapter 7 


SOFTWARE REDUNDANCY 


Programs are really not much more than the programmer’s best guess about what a system 
should do. 

—Russel Abbot 


1. Introduction 

In this chapter, we discuss techniques for software fault-tolerance. In general, 
fault-tolerance in the software domain is not as well understood and mature as 
fault-tolerance in the hardware domain. Controversial opinions exist on whether 
reliability can be used to evaluate software. Software does not degrade with 
time. Its failures are mostly due to the activation of specification or design faults 
by the input sequences. So, if a fault exists in software, it will manifest itself first 
time when the relevant conditions occur. This makes the reliability of a software 
module dependent on the environment that generates the input to the module 
over time. Different environments might result in different reliability values. 
The Ariane 5 rocket accident is an example of how a piece of software, safe for 
the Ariane 4 operating environment, can cause a disaster in a new environment. 
As we described in Section 3.2, the Ariane 5 rocket exploded 37 seconds after its 
lift-off, due to complete loss of guidance and attitude information. The loss of 
information was caused by a fault in the software of the inertial reference system, 
resulted from violating the maximum floating point number assumption. 

Many current techniques for software fault tolerance attempt to leverage the 
experience of hardware redundancy schemes. For example, software A-version 
programming closely resembles hardware A-modular redundancy. Recovery 
blocks use the concept of retrying the same operation in expectation that the 
problem is resolved after the second try. However, traditional hardware fault 
tolerance techniques were developed to fight permanent components faults pri- 
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marily, and transient faults caused by environmental factors secondarily. They 
do not offer sufficient protection against design and specification faults, which 
are dominant in software. By simply triplicating a software module and voting 
on its outputs we cannot tolerate a fault in the module, because all copies have 
identical faults. Design diversity technique, described in Section 3.3, has to 
be applied. It requires creation of diverse and equivalent specifications so that 
programmers can design software which do not share common faults. This is 
widely accepted to be a difficult task. 

A software system usually has a very large number of states. For example, 
a collision avoidance system required on most commercial aircraft in the U.S., 
has 1040 states. Large number of states would not be a problem if the states 
exhibited adequate regularity to allow grouping them into equivalence classes. 
Unfortunately, software does not exhibit the regularity commonly found in 
digital hardware. The large number of states implies that only a very small part 
of the software system can be verified for correcfness. Tradifional fesfing and 
debugging mefhods are nof feasible for large sysfems. The recenf focus on using 
formal mefhods fo describe fhe required characferisfics of fhe soffware behavior 
promises higher coverage, however, due fo fheir exfremely large compufafional 
complexify formal mefhods are only applicable in specific applicafions. Due 
fo incomplefe verificafion, some design faulfs are nof diagnosed and are nof 
removed from fhe soffware. 

Soffware faulf-folerance fechniques can be divided info fwo groups: single¬ 
version and mulfi-version. Single version fechniques aim fo improve faulf- 
foleranf capabilifies of a single soffware module by adding faull defection, con- 
fainmenf and recovery mechanisms fo ifs design. Mulfi-version fechniques em¬ 
ploy redundanf soffware modules, developed following design diversify rules. 
As in fhe hardware case, a number of possibilities has fo be examined fo deter¬ 
mine af which level fhe redundancy needs fo be provided and which modules 
are fo be made redundanf. The redundancy can be applied fo a procedure, or fo a 
process, or fo fhe whole soffware system. Usually, fhe componenfs which have 
high probabilify of faulfs are chosen fo be made redundanf. As in fhe hardware 
case, fhe increase in complexify caused by fhe redundancy can be quife severe 
and may diminish fhe dependabilify improvemenf, unless redundanf resources 
are allocated in a proper way. 

2. Single-version techniques 

Single version fechniques add fo a single soffware module a number of func- 
fional capabilifies fhaf are unnecessary in a faulf-free environmenf. The soffware 
sfrucfure and ifs acfions are modified fo be able fo defecf a faull, isolale if and 
prevenl fhe propagalion of ifs effecf Ihroughoul fhe system. In Ihis secfion, we 
consider how faull detection, faull conlainmenl and faull recovery are achieved 
in fhe soffware domain. 
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2.1 Fault detection techniques 

As in the hardware ease, the goal of fault deteetion in software is to determine 
that a fault has oeeurred within a system. Single-version fault toleranee teeh- 
niques usually use various types of acceptance tests to deteet faults. The result 
of a program is subjeeted to a test. If the result passes the test, the program 
eontinues its exeeution. A failed test indieates a fault. A test is most effeetive 
if it ean be ealeulated in a simple way and if it is based on eriteria that ean 
be derived independently of the program applieation. The existing teehniques 
inelude timing eheeks, eoding eheeks, reversal eheeks, reasonableness eheeks 
and struetural eheeks. 

Timing checks are applieable to systems whose speeifieation inelude timing 
eonstrains. Based on these eonstrains, eheeks ean be developed to indieate 
a deviation from the required behavior. A watchdog timer is an example of 
a timing eheek. Watehdog timers are used to monitor the performanee of a 
system and deteet lost or looked out modules. 

Coding checks are applieable to systems whose data ean be eneoded using 
information redundaney teehniques. Cyelie redundaney eheeks ean be used in 
eases when the information is merely transported from one module to another 
without ehanging it eontent. Arithmetie eodes ean be used to deteet errors in 
arithmetie operations. 

In some systems, it is possible to reverse the output values and to eompute the 
eorresponding input values. For sueh system, reversal checks ean be applied. 
A reversal eheek eompares the aetual inputs of the system with the eomputed 
ones. A disagreement indieates a fault. 

Reasonableness checks use semantie properties of data to deteet fault. For 
example, a range of data ean be examined for overflow or underflow to indieate 
a deviation from system’s requirements. 

Structural checks are based on known properties of data struetures. For 
example, a number of elements in a list ean be eounted, or links and pointers ean 
be verified. Struetural eheeks ean be made more effieient by adding redundant 
data to a data strueture, e.g. attaehing eounts on the number of items in a list, 
or adding extra pointers. 

2.2 Fault containment techniques 

Fault eontainment in software ean be aehieved by modifying the strueture 
of the system and by putting a set of restrietions defining whieh aetions are 
permissible within the system. In this seetion, we deseribe four teehniques 
for fault eontainment: modularization, partitioning, system elosure and atomie 
aetions. 

It is eommon to deeompose a software system into modules with few or 
no eommon dependeneies between them. Modularization attempts to prevent 
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the propagation of faults by limiting the amount of eommunieation between 
modules to earefully monitored messages and by eliminating shared resourees. 
Before performing modularization, visibility and eonneetivity parameters are 
examined to determine whieh module possesses highest potential to eause sys¬ 
tem failure. The visibility of a module is eharaeterized by the set of modules 
that may be invoked direetly or indireetly by the module. The connectivity of a 
module is deseribed by the set of modules that may be invoked direetly or used 
by the module. 

The isolation between funetionally independent modules ean be done by 
partitioning the modular hierarehy of a software arehiteeture in horizontal or 
vertieal dimensions. Horizontal partitioning separates the major software fune- 
tions into independent branehes. The exeeution of the funetions and the eom¬ 
munieation between them is done using eontrol modules. Vertieal partitioning 
distributes the eontrol and proeessing funetion in a top-down hierarehy. High- 
level modules normally foeus on eontrol funetions, while low-level modules 
perform proeessing. 

Another teehnique used for fault eontainment in software is system closure. 
This teehnique is based on the prineiple that no aetion is permissible unless 
explieitly authorized. In an environment with many restrietions and striet eon¬ 
trol (e.g. in prison) all the interaetions between the elements of the system are 
visible. Therefore, it is easier to loeate and remove any fault. 

An alternative teehnique for fault eontainment uses atomic actions to define 
interaetions between system eomponents. An atomie aetion among a group of 
eomponents is an aetivity in whieh the eomponents internet exelusively with 
eaeh other. There is no interaetion with the rest of the system for the duration 
of the aetivity. Within an atomie aetion, the partieipating eomponents neither 
import, nor export any type of information from non-partieipating eomponents 
of the system. There are two possible outeomes of an atomie aetion: either it 
terminates normally, or it is aborted upon a fault deteetion. If an atomie aetion 
terminates normally, its results are eorreet. If a fault is deteeted, then this fault 
affeets only the partieipating eomponents. Thus, the fault eontainment area is 
defined and fault reeovery is limited to the atomie aetion eomponents. 

2.3 Fault recovery techniques 

Onee a fault is deteeted and eontained, a system attempts to reeover from 
the faulty state and regain operational status. If fault deteetion and containment 
mechanisms are implemented properly, the effects of the faults are contained 
within a particular set of modules at the moment of fault detection. The knowl¬ 
edge of fault containment region is essential for the design of effective fault 
recovery mechanism. 
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2.3.1 Exception handling 

In many software systems, the request for initiation of fault reeovery is issued 
by exception handling. Exeeption handling is the interruption of the normal op¬ 
eration to handle abnormal responses. Possible events triggering the exeeptions 
in a software module ean be elassified into three groups: 

1 Interface exceptions are signaled by a module when it deteets an invalid 
serviee request. This type of exeeption is supposed to be handled by the 
module that requested the serviee. 

2 Local exceptions are signaled by a module when its fault detention meeha- 
nism deteets a fault within its internal operations. This type of exeeption is 
supposed to be handled by the faulty module. 

3 Failure exceptions are signaled by a module when it has deteeted that its 
fault reeovery meehanism is unable to reeover sueeessfully. This type of 
exeeption is supposed to be handled by the system. 

2.3.2 Checkpoint and restart 

A popular recovery mechanism for single-version software fault tolerance 
is checkpoint and restart, also referred to as backward error recovery. As 
mentioned previously, most of the software faults are design faults, activated 
by some unexpected input sequence. These type of faults resemble hardware 
intermittent faults: they appear for a short period of time, then disappear, and 
then may appear again. As in the hardware case, simply restarting the module 
is usually enough to successfully complete its execution. 

The general scheme of checkpoint and restart recovery mechanism is shown 
in Figure 7.1. The module executing a program operates in combination with 
an acceptance test block AT which checks the correctness of the result. If a 
fault is detected, a “retry” signal is send to the module to re-initialize its state 
to the checkpoint state stored in the memory. 


input 



output 


Figure 7.1. Checkpoint and restart recovery. 
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There are two types of eheekpoints: statie and dynamie. A statie eheekpoint 
takes a single snapshot of the system state at the beginning of the program 
exeeution and stores it in the memory. Fault deteetion eheeks are plaeed at 
the output of the module. If a fault is deteeted, the system returns to this 
state and starts the exeeution from the beginning. Dynamie eheekpoints are 
ereated dynamieally at various points during the exeeution. If a fault is deteeted, 
the system returns to the last eheekpoint and eontinues the exeeution. Fault 
deteetion eheeks need to be embedded in the eode and exeeuted before the 
eheekpoints are ereated. 

A number of faetors infiuenee the effieiency of eheekpointing, ineluding exe¬ 
eution requirements, the interval between eheekpoints, fault aetivation rate and 
overhead assoeiated with ereating fault deteetion eheeks, eheekpoints, reeovery, 
ete. In a statie approaeh, the expeeted time to eomplete the exeeution grows 
exponentially with the exeeution requirements. Therefore, statie eheekpointing 
is effeetive only if the proeessing requirement is relatively small. In a dynamie 
approaeh, it is possible to aehieve a linear inerease in exeeution time as the 
processing requirements grow. There are three strategies for dynamic placing 
of checkpoints: 

1 Equidistant, which places checkpoints at deterministic fixed time intervals. 
The time between checkpoints is chosen depending on the expected fault 
rate. 

2 Modular, which places checkpoints at the end of the sub-modules in a mod¬ 
ule, after the fault detection checks for the sub-module are completed. The 
execution time depends on the distribution of the sub-modules and expected 
fault rate. 

3 Random, placing checkpoints at random. 

Overall, restart recovery mechanism has the following advantages: 

■ It is conceptually simple. 

■ It is independent of the damage caused by a fault. 

■ It is applicable to unanticipated faults. 

■ It is general enough to be used at multiple levels in a system. 

A problem with restart recovery is that non-recoverable actions exist in 
some systems. These actions are usually associated with external events that 
cannot be compensated by simply reloading the state and restarting the system. 
Examples of non-recoverable actions are firing a missile or soldering a pair of 
wires. The recovery from such actions need fo include special freafmenf, for 
example by compensating for fheir consequences (e.g. undoing a solder), or 
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delaying their output until after additional eonfirmation eheeks are eompleted 
(e.g. do a friend-or-foe eonfirmation before firing). 

2.3.3 Process pairs 

Process pair fechnique runs fwo identical versions of fhe soflware on separafe 
processors (Figure 7.2). Firsf fhe primary processor, Processor 1, is active. If 
execufes fhe program and sends fhe checkpoinf information fo fhe secondary 
processor, Processor 2. If a faulf is defecfed, fhe primary processor is swifched 
off. The secondary processor loads fhe lasf checkpoinf as ifs sfarfing sfafe and 
confinues fhe execution. The Processor 1 execufes diagnostic checks off-line. 
If fhe faulf is non-recoverable, fhe replacemenf is performed. Affer refurning 
fo service, fhe repaired processor becomes secondary processor. 

The main advanfage of process pair fechnique is fhaf fhe delivery of service 
confinues uninferrupfed affer fhe occurrence of fhe faulf. If is fherefore suifable 
for applications requiring high availabilify. 


input 1 


input 2 



Figure 7.2. Process pairs. 


2.3.4 Data diversity 

Data diversity is a technique aiming to improve the efficiency of checkpoinf 
and resfarf by using differenf inpufs re-expressions for each refry. Ifs is based 
on fhe observafion fhaf soffware faulfs are usually inpuf sequence dependenf. 
Therefore, if inpufs are re-expressed in a diverse way, if is unlikely fhaf differenf 
re-expressions acfivafe fhe same faulf. 

There are fhree basic fechniques for dafa diversify: 

1 Inpuf dafa re-expression, where only fhe inpuf is changed. 

2 Inpuf dafa re-expression wifh posf-execufion adjusfmenf, where fhe oufpuf 
resulf also needs fo be adjusfed in accordance wifh a given sef of rules. For 
example, if fhe inpufs were re-expressed by encoding fhem in some code, 
fhen fhe oufpuf resulf is decoded following fhe decoding rules of fhe code. 
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3 Input data re-expression via deeomposition and re-eombination, where the 
input is deeomposed into smaller parts and then re-eombined after exeeution 
to obtain the output result. 


Data diversity ean also be used in eombination with the multi-version fault- 
toleranee teehniques, presented in the next seetion. 

3. Multi-version techniques 

Multi-version teehniques use two or more versions of the same software 
module, whieh satisfy the design diversity requirements. For example, differ¬ 
ent teams, different eoding languages or different algorithms ean be used to 
maximize the probability that all the versions do not have eommon faults. 

3.1 Recovery blocks 

The reeovery bloeks teehnique eombines eheekpoint and restart approaeh 
with standby sparing redundaney seheme. The basie eonfiguration is shown 
in Figure 7.3. Versions 1 to n represent different implementations of the same 
program. Only one of the versions provides the system’s output. If an error is 
deteeted by the aeeeptanee test, a retry signal is sent to the switeh. The system 
is rolled baek to the state stored in the eheekpoint memory and the switeh 
then switehes the exeeution to another version of the module. Cheekpoints 
are ereated before a version exeeutes. Various eheeks are used for aeeeptanee 
testing of the aetive version of the module. The eheek should be kept simple 
in order to maintain exeeution speed. Cheeks ean either be plaeed at the output 
of a module, or embedded in the eode to inerease the effeetiveness of fault 
deteetion. 

Similarly to eold and hot versions of hardware standby sparing teehnique, 
different versions ean be exeeuted either serially, or eoneurrently, depending 
on available proeessing eapability and performanee requirements. Serial exe¬ 
eution may require the use of eheekpoints to reload the state before the next 
version is exeeuted. The eost in time of trying multiple versions serially may 
be too expensive, espeeially for a real-time system. However, a eoneurrent sys¬ 
tem requires the expense of n redundant hardware modules, a communications 
network to connect them and the use of input and state consistency algorithms. 

If all n versions are tried and failed, the module invokes the exception handler 
to communicate to the rest of the system a failure to complete its function. 

As all multi-version techniques, recovery blocks technique is heavily depen¬ 
dent on design diversity. The recovery blocks method increases the pressure on 
the specification to be detailed enough to create different multiple alternatives 
that are functionally the same. This issue is further discussed in Section 3.4. 
In addition, acceptance tests suffer from lack of guidelines for their develop- 
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output 


Figure 7.3. Recovery blocks. 


ment. They are highly applieation dependent, they are diffieult to ereate and 
they eannot test for a speeifie eorreet answer, hut only for “aeeeptahle” values. 

3.2 A^-version programming 

The N-version programming teehniques resembles the A^^-modular hardware 
redundaney. The bloek diagram is shown in Figure 7.4. It eonsists of n different 
software implementations of a module, exeeuted eoneurrently. Eaeh version 
aeeomplishes the same task, but in a different way. The seleetion algorithm 
decides which of the answers is correct and returns this answer as a result 
of the module’s execution. The selection algorithm is usually implemented 
as a generic voter. This is an advantage over recovery block fault detection 
mechanism, requiring application dependent acceptance tests. 


input 1 
input 2 


input n 



output 


Figure 7.4. A^-version programming. 
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Many different types of voters have been developed, ineluding formalized 
majority voter, generalized median voter, formalized plurality voter and weighted 
averaging teehnique. The voters have the eapability to perform inexaet voting 
by using the eoneept of metric space {X^d). The set X is the output spaee of the 
software and <i is a metrie funetion that assoeiates any two elements in X with 
areal-valued number (see Seetion 2.5 for the definition of metrie). The inexaet 
values are deelared equal if their metrie distanee is less than some pre-defined 
threshold s. In the. formalized majority voter, the outputs are eompared and, if 
more than half of the values agree, the voter output is seleeted as one of the val¬ 
ues in the agreement group. The generalized median voter seleets the median 
of the values as the eorreet result. The median is eomputed by sueeessively 
eliminating pair of values that are farther apart until only one value remains. 
The formalized plurality voter partitions the set of outputs based on metrie 
equality and seleets the output from the largest partition group. The weighted 
averaging technique eombines the outputs in a weighted average to produee 
the result. The weight ean be seleeted in advanee based on the eharaeteristies 
of the individual versions. If all the weights are equal, this teehnique reduees 
to the mean seleetion teehnique. The weight ean be also seleeted dynamieally 
based on pair-wise distanees of the version outputs or the sueeess history of the 
versions measured by some performanee metrie. 

The seleetion algorithms are normally developed taking into aeeount the 
eonsequenees of erroneous output for dependability attributes like reliability, 
availability and safety. For applieations where reliability is important, the se¬ 
leetion algorithm should be designed so that the seleeted result is eorreet with 
a very high probability. If availability is an issue, the seleetion algorithm is 
expeeted to produee an output even if it is ineorreet. Sueh an approaeh would 
be aeeeptable as long as the program exeeution in not subsequently dependent 
on previously generated (possibly erroneous) results. For applieations where 
safety is the main eoneern, the seleetion algorithm is required to eorreetly dis¬ 
tinguish the erroneous version and mask its results. In eases when the algorithm 
eannot seleet the eorreet result with a high eonfidenee, it should report to the 
system an error eondition or initiate an aeeeptable safe output sequenee. 

M-version programming teehnique ean tolerate the design faults present in 
the software if the design diversity eoneept is implemented properly. Eaeh 
version of the module should be implemented in an as diverse as possible 
manner, ineluding different tool sets, different programming languages, and 
possibly different environments. The various development groups must have 
as little interaetion related to the programming between them as possible. The 
speeifieation of the system is required to be detailed enough so that the various 
versions are eompletely eompatible. On the other hand, the speeifieation should 
be flexible fo give fhe programmer a possibility fo ereafe diverse designs. 
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3.3 N self-checking programming 

N self-checking programming combines recovery blocks concept with N 
version programming. The checking is performed either by using acceptance 
tests, or by using comparison. Examples of applications of N self-checking 
programming are Lucent ESS-5 phone switch and the Airbus A-340 airplane. 

N self-checking programming using acceptance tests is shown in Eigure 
7.5. Different versions of the program module and the acceptance tests AT are 
developed independently from common requirements. The individual checks 
for each of the version are either embedded in the code, or placed at the output. 
The use of separate acceptance tests for each version is the main difference of 
this technique from recovery blocks approach. The execution of each version 
can be done either serially, or concurrently. In both cases, the output is taken 
from the highest-ranking version which passes its acceptance test. 


input 1 


input 2 


input n 



output 


Figure 7.5. N self-checking programming using acceptance tests. 


N self-checking programming using comparison is shown in Eigure 7.6. 
The scheme resembles triplex-duplex hardware redundancy. An advantage 
over N self-checking programming using acceptance tests is that an application 
independent decision algorithm (comparison) is used for fault detection. 

3.4 Design diversity 

The most critical issue in multi-version software fault tolerance techniques 
is assuring independence between the different versions of software through 
design diversity. Design diversity aims to protect the software from containing 
common design faults. Software systems are vulnerable to common design 
faults if they are developed by the same design team, by applying the same 
design rules and using the same software tools. 
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input lA 
input IB 

input 2A 
input 2B 

input nA 
input nB 



output 


Figure 7.6. N self-checking programming using comparison. 


Presently, the implementation of design diversity remains a controversial 
subject. The increase in complexity caused by redundant multiple versions can 
be quite severe and may result in a less dependent system, unless appropriate 
measures are taken. Decision to be made when developing a multi-version 
software system include: 

■ which modules are to be made redundant (usually less reliable modules are 
chosen); 

■ the level of redundancy (procedure, process, whole system); 

■ the required number of redundant versions; 

■ the required diversity (diverse specification, algorithm, code, programming 
language, testing technique, etc.); 

■ rules of isolation between the development teams, to prevent the flow of 
informalion fhaf could resulf in common design error. 

The cosf of developmenf of a mulfi-version soflware also needs fo be faken 
info accounf. A direcf replicafion of fhe full developmenf effort would have 
a fofal cosf prohibifive for mosf applications. The cosf can be reduced by 
allocating redundancy fo dependabilify crifical parfs of fhe system only. In 
sifuafions where demonsfrafing dependabilify fo an official regulafory aufhorify 
lends fo be more cosily lhan fhe aclual developmenf effort, design diversify can 
be used fo make a more dependable system wilh a smaller safely assessmenl 
efforl. When fhe cosf of allernalive dependabilify improvemenf techniques 
is high because of fhe need for specialized slaff and fools, fhe use of design 
diversify can resulf in cosf savings. 
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4. Software Testing 

Software testing is the proeess of exeeuting a program with the intent of 
finding errors [Beizer, 1990]. Testing is a major eonsideration in software 
development. In many organizations, more time is devoted to testing than to 
any other phase of software development. On eomplex projeets, test developers 
might be twiee or three times as many as eode developers on a projeet team. 

There are two types of software testing: funetional and struetural. Functional 
testing (also ealled behavioral testing, black-box testing, closed-box testing), 
eompares test program behavior against its speeifieation. Structural testing 
(also ealled white-box testing, glass-box testing) eheeks the internal strueture 
of a program for errors. For example, suppose we test a program whieh adds two 
integers. The goal of funetional testing is to verify whether the implemented 
operation is indeed addition instead of e.g. multiplieation. Struetural testing 
does not question the funetionally of the program, but eheeks whether the inter¬ 
nal strueture is eonsistent. A strength of the struetural approaeh is that the entire 
software implementation is taken into aeeount during testing, whieh faeilitates 
error deteetion even when the software speeifieation is vague or ineomplete. 

The effeetiveness of struetural testing is normally expressed in terms of 
test eoverage metries, whieh measure the fraetion of eode exereised by test 
eases. Common test eoverage metries are statement, braneh, and path eover¬ 
age [Beizer, 1990]. Statement eoverage requires that the program under test is 
run with enough test eases, so that all its statements are exeeuted at least onee. 
Decision eoverage requires that all branehes of the program are exeeuted at 
least onee. Path eoverage requires that eaeh of the possible paths through the 
program is followed. Path eoverage is the most reliable metrie, however, it is 
not applieable to large systems, sinee the number of paths is exponential to the 
number of branehes. 

This seetion deseribes a teehnique for struetural testing whieh finds a parf 
of program’s flowgraph, ealled kernel, wifh fhe properfy fhaf any sef of fesfs 
whieh exeeufes all verfiees (edges) of fhe kernel exeeufes all verfiees (edges) 
of fhe flowgraph [Dubrova, 2005]. 

Relafed works inelude AgarvaTs algorifhm [Agrawal, 1994] for eompufing 
fhe super block dominator graph whieh represenfs all fhe kernels of fhe flow- 
graph, Berfolino and Marre’s algorifhm [Berfolino and Marre, 1994] for finding 
pafh eovers in a flowgraph, in whieh unconstrained arcs are analogous fo fhe 
leaves of fhe dominafor free; Ball’s [Ball, 1993] and Podgurski’s [Podgurski, 
1991] feehniques for eompufing control dependence regions in a flowgraph, 
whieh are similar fo fhe super bloeks of [Agrawal, 1994]; AgarwaTs algo¬ 
rifhm [Agrawal, 1999], whieh addresses fhe eoverage problem af an infer- 
proeedural level. 
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4.1 Statement and Branch Coverage 

This section gives a brief overview of statement and branch coverage tech¬ 
niques. 

4.1.1 Statement Coverage 

Statement coverage (also called line coverage, segment coverage [Ntafos, 
1988], Cl [Beizer, 1990]) examines whether each executable statement of a 
program is followed during a test. An extension of statement coverage is basic 
block coverage, in which each sequence of non-branching statements is treated 
as one statement unit. 

The main advantage of statement coverage is that it can be applied directly 
to object code and does not require processing source code. The disadvantages 
are: 


■ Statement coverage is insensitive to some control structures, logical AND 
and OR operators, and switch labels. 

■ Statement coverage only checks whether the loop body was executed or 
not. It does not report whether loops reach their termination condition. In 
C, and Java programs, this limitation affects loops that contain break 
statements. 

As an example of the insensitivity of statement coverage to some control 
structures, consider the following code: 

X = 0; 

if (condition) 

X = X + 1; 

y = 10/x; 

If there is no test case which causes condition to evaluate false, the error 
in this code will not be detected in spite of 100% statement coverage. The 
error will appear only if condition evaluates false for some test case. Since 
if-statements are common in programs, this problem is a serious drawback of 
statement coverage. 

4.1.2 Branch Coverage 

Branch coverage (also referred to as decision coverage, all-edges cover¬ 
age [Roper, 1994], C2 [Beizer, 1990]) requires that each branch of a program 
is executed at least once during a test. Boolean expressions of if - or while- 
statements are checked to be evaluated to both true and false. The entire Boolean 
expression is treated as one predicate regardless of whether it contains logical 
AND and OR operators. Switch statements, exception handlers, and interrupt 
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handlers are treated similarly. Deeision eoverage ineludes statement eoverage 
sinee executing every branch leads to executing every statement. 

An advantage of branch coverage is its relative simplicity. It allows over¬ 
coming many problems of statement coverage. However, it might miss some 
errors as demonstrated by the following example: 

if (conditionl) 

X = 0; 

else 

X = 2; 

if (condition2) 
y = 10*x; 

else 

y = 10/x; 

A 100% branch coverage can be achieved by two test cases which cause 
both conditionl and condition2 to evaluate true, and both conditionl 
and condition2 to evaluate false. However, the error which occurs when 
conditionl evaluates true and condit ion2 evaluates false will not be detected 
by these two tests. 

The error in the example above can be detected by exercising every path 
through the program. However, since the number of paths is exponential to 
the number of branches, testing every path is not possible for large systems. 
For example, if one test case takes 0.1 x 10“^ seconds to execute, then testing 
all paths of a program containing 30 if-statements will take 18 minutes and 
testing all paths of a program with 60 if-statements will take 366 centuries. 

Branch coverage differs from basic path coverage, which requires each basis 
path in the program flowgraph to be executed during a test [Watson, 1996]. Basis 
paths are a minimal subset of paths that can generate all possible paths by linear 
combination. The number of basic paths is called the cyclomatic number of the 
flowgraph. 

4.2 Preliminaries 

A flowgraph is adirected graph G = {V,E,entry,exit), where V is the set of 
vertices representing basic blocks of the program, £" C F x F is the set of edges 
connecting the vertices, and entry and exit are two distinguished vertices of F. 
Every vertex in F is reachable from entry vertex, and exit is reachable from 
every vertex in F. 

Figure 7.8 shows the flowgraph of the C program in Figure 7.7, where 
bl,b2,... ,b\6 we blocks whose contents are not relevant for our purposes. 
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algorithm Example 
bl; 

while (b2) { 
for(b3) { 
b4; 

for(b5) { 

if(b6) b7; 
else bS; 

} 

if(b9) break; 

} 

if(blO) { 

while(bll) bl2; 

} 

else { 

if(bl3) bl4; 
else continue; 

} 

bl5; 

} 

bl6; 

end 


Figure 7.7. Example C program. 


A vertex v pre-dominates another vertex u, if every path from entry to u 
eontains v. A vertex v post-dominates another vertex u, if every path from u to 
exit eontains v. 

By Pre{v) and Post{v) we denote sets of all nodes whieh pre-dominate and 
post-dominate v, respeetively. E.g. in Figure 7.8, Pre{5) = {1,2,3,4} and 
Post{5) = {9,10,16}. 

Many properties are eommon for pre- and post-dominators. Further in the 
paper, we use the word dominator to refer to eases whieh apply to both rela¬ 
tionships. 

Vertex v is the immediate dominator of u, if v dominates u and every other 
dominator of u dominates v. Every vertex v € V exeept entry {exit) has a 
unique immediate pre-dominator (post-dominator), idom{v) [Fowry and Med- 
loek, 1969]. Forexample, inFigure7.8, vertex4 is the immediate pre-dominator 
of 5, and vertex 9 is the immediate post-dominator of 5. The edges {idom{v) , v) 
form a direeted tree rooted at entry for pre-dominators and at exit for post- 
dominators. Figures 7.9 and 7.10 show the pre- and post-dominator trees of the 
flowgraph in Figure 7.8. 

The problem of finding dominators was first eonsidered in late 60’s by Forry 
and Medloek [Fowry and Medloek, 1969]. They presented an C?(|V|^) algo- 
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entry 



Figure 7.8. Flowgraph of the program in Figure 7.7. 


rithm for finding all immediate dominators in a flowgraph. Sueeessive im¬ 
provements of this algorithm were done by Aho and Ullman [Aho and Ullman, 
1972], Purdom and Moore [Purdom and Moore, 1972], Tarjan [Tarjan, 1974], 
and Lengauer and Tarjan’s [Lengauer and Tarjan, 1979]. Lengauer and Tar- 
jan’s algorithm [Lengauer and Tarjan, 1979] is a nearly-linear algorithm with 
the eomplexity (9(|£'|a(|£'|, |P|)), where a is the standard funetional inverse 
of the Aekermann funetion. Linear algorithms for finding dominators were 
presented by Harel [Harrel, 1985], Alstrup et al. [Alstrup et ah, 1999], and 
Buehsbaum et al. [Buehsbaum et ah, 1998]. 

4.3 Statement Coverage Using Kernels 

This seetion presents a teehnique [Dubrova, 2005] for finding a subset of 
the program’s flowgraph vertiees, ealled kernel, with the property that any set 
of tests whieh exeeutes all vertiees of the kernel exeeutes all vertiees of the 
flowgraph. A 100% statement eoverage ean be aehieved by eonstrueting a set 
of tests for the kernel. 
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Figure 7.9. Pre-dominator tree of the flowgraph in Figure 7.8; shaded vertices are leaves of the 
tree in Figure 7.10. 


Definition 7.1 A vertex v eV of the flowgraph is covered by a test case t 
if the basic block of the program representing v is reached at least once during 
the execution oft. 

The following Lemma is the basic property of the presented technique [Agrawal, 
1994]. 

Lemma 1 If a test case t covers uEV, then it covers any post-dominator ofu 
as well: 

{t covers u) A {vEPost{u)) => {t covers v). 

Proof: If V post-dominates u, then every path from u to exit contains v. There¬ 
fore, if u is reached at least once during the execution of t, then v is reached, 
too. 

Definition 7.2 A kernel K of a flowgraph G is a subset of vertices of G 
which satisfies the property that any set of tests which executes all vertices of 
the kernel executes all vertices of G. 

Definition 7.3 A minimum kernel is a kernel of the smallest size. 

Let Lpost (Lpre) denote the set of leaf vertices of the post-(pre-)dominator tree 
of G. The set c Lpost contains vertices of Lpost which pre-dominate some 


DRAFT 


March 25, 2008, 2:12ain 


DRAFT 


Software redundancy 


131 


vertex of Lpost'. 

= {v I (v e Lpost) A (v e Pre{u) for some u € Lpost)]. 

Similarly, the subset C Lpre eontains all vertiees of Lpre which post-dominate 
some vertex of Lpre'- 

Lpre = I ^ ^pre) A (v € PoSt{u) for SOmC U e Lpre)}- 

Assume that the program execution terminates normally on all test cases 
supplied. Then the following statement holds. 

Theorem 7.1 The set Lpost — L^^yst ^ minimum kernel- 

Proof: Lemma 1 shows that, if a vertex of a flowgraph is covered by a test 
case t, then all its post-dominators are also covered by t- Therefore, in order to 
cover all vertices of a flowgraph, it is sufficient to cover all leaves Lpost in its 
post-dominator tree, i.e. Lpost is a kernel. 

L’^ost contains all vertices of Lpost which pre-dominate some vertex of Lpost- 
If V is a pre-dominator of u, and u is covered by t, then v is also covered by t, 
since every path from entry to u contains v as well. Thus, any set of tests which 
covers Lpost - Lpost, covers Lpost as well. Since Lpost is a kernel, Lpost - Lpost 
is a kernel, too. 

Next, we prove that the set Lpost — L^gst n minimum kernel. Suppose that 
there exists another kernel, K', such that < \Lpost — L^ost\- If v € and 

V ^ Lpost, then v € Post{u) for some u € Lpost- Since every path from u to exit 
contains v, if u is reached at least once during the execution of some test case, 
then V is reached, too. Therefore, K' remains a kernel if we replace v by u- 

Suppose we replaced all v E K' such that v ^ Lpost by n € Lpost such that 

V € Post{u)- Now, K' C Lpost- If there exists w E Lpost — K' such that, for all 
u E K', w ^ Pre{u) then there exists at least one path from entry to each u which 
does not contain w- This means that there exists a test set, formed by the set 
of paths path{u) where path{u) is the path to u which does not contain w, that 
covers K' but not w- According to Definition 7.2, this implies that K' is not a 
kernel. Therefore, to guarantee that K' is a kernel, w must be a pre-dominator 
of some u E K', for all w E Lpost — K'- This implies that = \Lpost — 7-povl- 

The next theorem shows that the set Lpre — L^re a minimum kernel. 

Theorem 7.2 


\Lpost L 


D I 
post I 


\^pre ^ 


D I 
pre\’ 


The proof is done by showing that the proof of minimality of Theorem 7.1 
can be carried out starting from Lpre- 
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Figure 7.10. Post-dominator tree of the flowgraph in Figure 7.8; shaded vertices are leaves of 
the tree in Figure 7.9. 


4.4 Computing Minimum Kernels 

This section presents a linear-time algorithm for computing minimum kernels 
from [Dubrova, 2005]. The pseudo-code is shown in Figure 7.11. 

First, pre- and post-dominator trees of the flowgraph G = {V,E,entry,exit), 
denoted by Tpre and Tpost, are computed. Then, the numbers of leaves of 
the trees, Lpre and Lpost, are compared. According to Theorems 7.1 and 7.2, 
both, Lpost — Lpost Lpre — L^^^, represent minimum kernels. The procedure 
FindLD is applied to the smaller of the sets Lpre and Lpost- 

FindLD checks whether the leaves Lpre of the tree Tpre are dominated by 
some vertex of Lpre in another tree, Tpost, or vice versa. In other words, FindL D 
computes the set (Lpost)- 

Theorem 7.3 The algorithmKERNEL computes a minimum kernel of a flow- 
graph G = {V,E, entry, exit) in 0(|F| -f iFj) time. 

Proof: The correctness of the algorithm follows directly from Theorems 7.1 
and 7.2. The complexity of the Kernel is determined by the complexity 
of computing the Tpre and Tpost trees. A dominator tree can be computed 
in 0{\V\ -f iFj) time [Alstrup et ah, 1999]. Thus, the overall complexity is 
0{\V\ + \E\). 

As an example, let us compute a minimum kernel for the flowgraph in Fig¬ 
ure 7.8. Its pre- and post-dominator trees are shown in Figures 7.9 and 7.10. 
Tpre has 7 leaves, Lpre = {7,8,9,12,14,15,16}, and Tpost has 9 leaves, Lpost = 


DRAFT 


March 25, 2008, 2:12ain 


DRAFT 


Software redundancy 


133 


algorithm Kernel(V,£, entry, exit ); 

Tpre = PreDominatorTree(V,£, entry); 

Lprg = set of leaves of Tpre\ 

Tpost = POSTDOMINATORTREE(V,£,eV!t); 
Lpost = set of leaves of Tpost', 
if Tpyo Tpost then 

K = Lpyo — FlNDLD(Lprei Tpost)', 
else 

^ — Tpost Tpyg), 

return AT; 
end 


Figure 7.11. Pseudo-code of the algorithm for computing minimum kernels. 


{1,3,4,6,7,8,12,13,14}. So, we check which of the leaves of Tpye dominates 
at least one other leaf of Tpyg in Tpost- Leaves Lpyg are marked as shaded circles 
in Tpost in Figure 7.10. We can see that, in Tpost, vertex 9 dominates 7 and 
8, vertex 15 dominates 12 and 14, and vertex 16 dominates all other leaves of 
Lpyg. Thus, L^yg = {9,15,16}. The minimum kernel Lpyg — Lpy^ consist of four 
vertices: 7,8,12 and 14. 

For a comparison, let us compute the minimum kernel given by Lpost — L^ost- 
The Lpost leaves are marked as shaded circles in Tpyg in Figure 7.9. We can see 
that, in Tpyg, vertex 4 dominates 6, 7 and 8, vertex 6 dominates 7 and 8, vertex 
13 dominates 14, vertex 3 dominates all leaves of Lpost except 1, and vertex 
1 dominates all leaves of Lpost- Thus, L^ost = {153,4,6,13}. The minimum 
kernel Lpost —L^ost eonsist of four vertices: 7,8,12 and 14. In this example, the 
kernels Lpyg — L^yg and Lpost — Lpost Ih® same, but it is not always the case. 

4.5 Decision Coverage Using Kernels 

The kernel-based technique described above can be similarly applied to 
branch coverage by constructing pre- and post-dominator trees for the edges of 
the flowgraph instead of for its vertices. Figures 7.12 and 7.13 show edge pre- 
and post-dominator tree of the flowgraph in Figure 7.8. 

Similarly to Definition 7.2, a kernel set for edges is defined as a subset 
of edges of the flowgraph which satisfies the property that any set of tests 
which executes all edges of the kernel executes all edges of the flowgraph. 
A 100% branch coverage can be achieved by constructing a set of tests for 
the kernel. Minimum kernels for Figures 7.12 and 7.13 are: Lpyg — Lpyg = 
{i,h,k,m,p,q,t,y} and Lpost - L^ost = {f^g^k,r,q,s,m,y}. 
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Figure 7.12. Edge pre-dominator tree of the flowgraph in Figure 7.8; shaded vertices are leaves 
of the tree in Figure 7.13. 


5. Problems 

7.1. A program consists of 10 independent routines, all of them being used in the 
normal operation of the program. The probability that a routine is faulty is 
0.10 for each of the routines. It is intended to use 3-version programming, 
with voting to be conducted after the execution of each routine. The ef¬ 
fectiveness of the voting in eliminating faults is 0.85 when one of the three 
routines is faulty and 0 when more than one routine is faulty. What is the 
probability of a fault-free program: 

(a) When only a single version is produced and no routine testing is con¬ 
ducted. 

(b) When only a single version of each routine is used, but extensive routine 
testing is conducted that reduces the fault content to 10% of the original 
level. 

(c) When three-version programming is used. 
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Figure 7. 13. Edge post-dominator tree of the flowgraph in Figure 7.8; shaded vertices are leaves 
of the tree in Figure 7.12. 
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